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“You haven’t told me yet,” said Lady Nuttal, ‘‘ what it is your 
fiancé does for a living.” 

“‘ He’s a statistician,” replied Lamia, with an annoying sense 
of being on the defensive. 

Lady Nuttal was obviously taken aback. It had not occurred 
to her that statisticians entered into normal social relationships. 
The species, she would have surmised, was perpetuated in some 
collateral manner, like mules. 

“But Aunt Sara, it’s a very interesting profession,”’ said Lamia 
warmly. 

“I don’t doubt it,” said her aunt, who obviously doubted it 
very much. ‘“ 'T’o express anything important in mere figures is so 
plainly impossible that there must be endless scope for well-paid 
advice on how to do it. But don’t you think that life with a 
statistician would be rather, shall we say, humdrum ? ” 

Lamia was silent. She felt reluctant to discuss the surprising 
depth of emotional possibility which she had discovered below 
Edward’s numerical veneer. 

“It’s not the figures themselves,’ 
you do with them that matters.” 


K. A. C. MANDERVILLE, The Undoing of Lamia Gurdleneck 
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she said finally, ‘‘ it’s what 
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PREFACE TO VOLUME TWO 


We present herewith the second volume of this treatise on the advanced theory of 
statistics. It covers the theory of estimation and testing hypotheses, statistical relation- 
ship, distribution-free methods and sequential analysis. The third and concluding 
volume will comprise the design and analysis of sample surveys and experiments, 
including variance analysis, and the theory of multivariate analysis and time-series. 

This volume bears very little resemblance to the original Volume 2 of Kendall’s 
Advanced Theory. It has had to be planned and written practically ab initio, owing 
to the rapid development of the subject over the past fifteen years. A glance at the 
references will show how many of them have been published within that period. 

As with the first volume, we have tried to make this volume self-contained in three 
respects: it lists its own references, it repeats the relevant tables given in the Appendices 
to Volume 1, and it has its own index. The necessity for taking up a lot of space with 
an extensive bibliography is being removed by the separate publication of Kendall 
and Doig’s comprehensive Bibhography of Statistical Literature. We have made a 
special effort to provide a good set of exercises: there are about 400 in this volume. 

For permission to quote some of the tables at the end of the book we are indebted 
to Professor Sir Ronald Fisher, Dr. Frank Yates, Messrs. Oliver and Boyd, and the 
editors of Biometrika. Mr. E. V. Burke of Charles Griffin and Company Limited has 
given his usual invaluable help in seeing the book through the press. We are also 
indebted to Mr. K. A. C. Manderville for permission to quote from an unpublished 
story the extract given on page v. 

As always, we shall be glad to be notified of any errors, misprints or obscurities. 


M. G. K. 
ALS. 
LONDON, 
March, 1961 


PREFACE TO SECOND EDITION 


Apart from the correction of minor misprints and errors, this volume has been 
generally revised for this edition. In a few places, the progress of the subject has 
made it necessary to re-write a page or two, although commonly only paragraphs or 
sentences have required alteration. Some hundreds of new references have been 
added in the course of the revision, together with a considerable number of new exercises, 
many of them from recent research. 

We are grateful to readers all over the world who have helped us to make these 
improvements. 


M. G. K. 
ASS 
LONDON, 
Fanuary, 1967 
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CHAPTER 17 


ESTIMATION 


The problem 


17.1 On several occasions in previous chapters we have encountered the problem 
of estimating from a sample the values of the parameters of the parent population. 
We have hitherto dealt on somewhat intuitive lines with such questions as arose—for 
example, in the theory of large samples we have taken the means and moments of the 
sample to be satisfactory estimates of the corresponding means and moments in the 
parent. 

We now proceed to study this branch of the subject in more detail. In the present 
chapter, we shall examine the sort of criteria which we require a “‘ good’ estimate 
to satisfy, and discuss the question whether there exist “ best” estimates in an accept- 
able sense of the term. In the next few chapters, we shall consider methods of obtaining 
estimates with the required properties. 


17.2 It will be evident that if a sample is not random and nothing precise is known 
about the nature of the bias operating when it was chosen, very little can be inferred 
from it about the parent population. Certain conclusions of a trivial kind are some- 
times possible—for instance, if we take ten turnips from a pile of 100 and find that 
they weigh ten pounds altogether, the mean weight of turnips in the pile must be greater 
than one-tenth of a pound; but such information is rarely of value, and estimation 
based on biassed samples remains very much a matter of individual opinion and cannot 
be reduced to exact and objective terms. We shall therefore confine our attention to 
random samples only. Our general problem, in its simplest terms, is then to estimate 
the value of a parameter in the parent from the information given by the sample. In 
the first instance we consider the case when only one parameter is to be estimated. 
The case of several parameters will be discussed later. 


17.3 Let us in the first place consider what we mean by “‘ estimation.”” We know, 
or assume as a working hypothesis, that the parent population is distributed in a form 
which is completely determinate but for the value of some parameter 6. We are given 
a sample of observations x,,...,%,. We require to determine, with the aid of observa- 
tions, a number which can be taken to be the value of 6, or a range of numbers which 
can be taken to include that value. 

Now the observations are random variables, and any function of the observations 
will also be a random variable. A function of the observations alone is called a statistic. 
If we use a statistic to estimate 0, it may on occasion differ considerably from the true 
value of 6. It appears, therefore, that we cannot expect to find any method of estima- 
tion which can be guaranteed to give us a close estimate of 6 on every occasion and 
for every sample. We must content ourselves with formulating a rule which will give 
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good results “in the long run” or “on the average,” or which has “a high prob- 
ability of success ’—phrases which express the fundamental fact that we have to regard 
our method of estimation as generating a distribution of estimates and to assess its 
merits according to the properties of this distribution. 


¢ 


17.4 It will clarify our ideas if we draw a distinction between the method or rule 
of estimation, which we shall call an estimator, and the value to which it gives rise 
in particular cases, the estimate. ‘The distinction is the same as that between a func- 
tion f(x), regarded as defined for a range of the variable x, and the particular value 
which the function assumes, say f(a), for a specified value of x equal toa. Our problem 
is not to find estimates, but to find estimators. We do not reject an estimator because 
it gives a bad result in a particular case (in the sense that the estimate differs materially 
from the true value). We should only reject it if it gave bad results in the long run, 
that is to say, if the distribution of possible values of the estimator were seriously 
discrepant with the true value of 6. ‘The merit of the estimator is judged by the distribu- 
tion of estimates to which it gives rise, i.e. by the properties of its sampling distribution. 


17.5 Inthe theory of large samples, we have often taken as an estimator of a para- 
meter 6 a statistic ¢ calculated from the sample in exactly the same way as @ is calcu- 
lated from the population : e.g. the sample mean is taken as an estimator of the parent 
mean. Let us examine how this procedure can be justified. Consider the case when 
the parent population is 


dF (x) = (2a) exp {—4(x«—6)*} dx, —-o<x< ow. (17.1) 
Requiring an estimator for the parent mean 0, we take 
ai opin (17.2) 
j=l 
The distribution of ¢ is (Example 11.12) 
dF (t) = {n/(2a) }* exp {—4n(t—0)*} dt, (17.3) 


that is to say, ¢ is distributed normally about 6 with variance 1/n. We notice two 
things about this distribution: (a) it has a mean (and median and mode) at the true 
value 0, and (b) as 7 increases, the scatter of possible values of t about 6 becomes smaller, 
so that the probability that a given ¢ differs by more than a fixed amount from @ de- 
creases. We may say that the accuracy of the estimator increases as m increases, i.e. 
with 2. 


17.6 Generally, it will be clear that the phrase “‘ accuracy increasing with n”’ has 
a definite meaning whenever the sampling distribution of ¢ has a variance which de- 
creases with 1/n and a central value which is either identical with 6 or differs from it 
by a quantity which also decreases with 1/n. Many of the estimators with which we 
are commonly concerned are of this type, but there are exceptions. Consider, for 
example, the Cauchy population 


dF (x) = 1 dx 
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ESTIMATION 3 


If we estimate 6 by the mean-statistic t we have, for thé distribution of 1, 
1 dt 

x {1+(t—6)?}’ 

(cf. Example 11.1). In this case the distribution of ¢ is the same as that of any single 
value of the sample, and does not increase in accuracy as m increases. 


dF (t) = (17.5) 


Consistency 

17.7 ‘The possession of the property of increasing accuracy is evidently a very 
desirable one ; and indeed, if the variance of the sampling distribution of an estimator 
decreases with increasing n, it is necessary that its central value should tend to 0, for 
otherwise the estimator would have values differing systematically from the true value. 
We therefore formulate our first criterion for a suitable estimator as follows :— 

An estimator ¢,, computed from a sample of values, will be said to be a consistent 
estimator of 6 if, for any positive « and 7, however small, there is some N such that 
the probability that 


|t,-O| <e | (17.6) 
is greater than 1—7 for all m > N. In the notation of the theory of probability, - 
Pit,—6| <apet l-y, n> N. 77) 


The definition bears an obvious analogy to the definition of convergence in the 
mathematical sense. Given any fixed small quantity «, we can find a large enough 
sample number such that, for all samples over that size, the probability that ¢ differs 
from the true value by more than ¢ is as near zero as we please. ¢, is said to converge 
in probability, or to converge stochastically, to 0. ‘Thus ¢ is a consistent estimator of 4 
if it converges to @ in probability. | 


Example 17.1 

The sample mean is a consistent estimator of the parameter 6 in the population 
(17.1). ‘This we have already established in general argument, but more formally the 
proof would proceed as follows :— 

Suppose we are given e. From (17.3) we see that (¢—0)n? is distributed normally 
about zero with unit variance. Thus the probability that | (t—6)n*| < en? is the value 
of the normal integral between limits +en?. Given any positive 7, we can always 
take large enough for this quantity to be greater than 1—7 and it will continue to 
be so for any larger ». N may therefore be determined and the inequality (17.7) is 
satisfied. 


Example 17.2 
Suppose we have a statistic ¢, whose mean value differs from 6 by terms of order 
n-1, whose variance v, is of order n-+ and which tends to normality as m increases. 
Clearly, as in Example 17.1, (¢,—6)/vi will then tend to zero in probability and 1, 
will be consistent. This covers a great many statistics encountered in practice. 
Even if the limiting distribution of ¢, is unspecified, the result will still hold, as 
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can be seen from a direct application of the Bienaymé-Tchebycheff inequality (3.94). 
In fact, if 
E(t,) = 0+k,, and vart, = Up, 


where limk, = limzv, = 0, 
n—> 0o n—> co 
we have at once P {|t,-(0+R,)| < e} 21 —- = — 1, 
n> co 


so that (17.7) will be satisfied. 


Unbiassed estimators 


17.8 ‘The property of consistency is a limiting property, that is to say, it concerns 
the behaviour of an estimator as the sample number tends to infinity. It requires 
nothing of the estimator’s behaviour for finite , and if there exists one consistent 
estimator f,, we may construct infinitely many others: e.g. for fixed a and 6 


n—a 
n—b" 
is also consistent. We have seen that in some circumstances a consistent estimator 
of the population mean is the sample mean 
H-= Lew; fa. 
But so is %’ = Xx,;/(n—1). 
Why do we prefer one to the other? Intuitively it seems absurd to divide the sum 
of n quantities by anything other than their number . We shall see in a moment, 
however, that intuition is not a very reliable guide in such matters. ‘There are reasons 
for preferring 
=. 1s : 
PS Feat (x;—%*)? to ohn (x; — x)? 
as an estimator of the parent variance, notwithstanding the fact that the latter is the 
sample variance. 


17.9 Consider the sampling distribution of an estimator ¢. If the estimator is 
consistent, its distribution must, for large samples, have a central value in the neigh- 
bourhood of 6. We may choose among the class of consistent estimators by requiring 
that 6 shall be equated to this central value not merely for large, but for all samples. 

If we require that for all m and @ the mean value of t shall be 6, we define what is 
known as an unbiassed estimator by the relation 

E(t) = 0. 

This is an unfortunate word, like so many in statistics. ‘There is nothing except con- 
venience to exalt the arithmetic mean above other measures of location as a criterion 
of bias. We might equally well have chosen the median of the distribution of ¢ or 
its mode as determining the “‘ unbiassed”’ estimator. ‘The mean value is used, as 
always, for its mathematical convenience. This is perfectly legitimate, and it is only 
necessary to remark that the term ‘“‘ unbiassed”’ should not be allowed to convey 
overtones of a non-technical nature. 


ESTIMATION 5 


Example 17.3 
1 1 , 
Since B{ aa} = BE (x)= M1) 


the mean-statistic is an unbiassed estimator of the parent mean whenever the latter 
exists. But the sample variance is not an unbiassed estimator of the parent variance. 


We have 
E {3 (0)-8)*} = B {2 [sj Bay/n}} 
= BY" Ea} BE aah 
9 N 554k 
= (n—1)uh— (0-1) uh" 
= (n—1) Hy. 


n—1 


Thus LO (x —)? has a mean value 4g. On the other hand, an unbiassed estimator 
n 


is given by = X (x — #)?, 


and for this reason it is usually preferred to the sample variance. 


Our discussion shows that consistent estimators are not necessarily unbiassed. We 
have already (Example 14.5) encountered an unbiassed estimator which is not con- 
sistent. ‘Thus neither property implies the other.. But a consistent estimator whose 
asymptotic distribution has finite mean value must be asymptotically unbiassed. 

In certain circumstances, there may be no unbiassed estimator (cf. Exercise 17.12). 
Even if there is one, it may occur that it necessarily gives absurd results at times, or even 
always. For example, in estimating a parameter 0, 0 < 0 < 1, no statistic distributed 
in the range (0, 1) will be unbiassed, for if 6 = 0 its expectation must (except in trivial 
degenerate cases) exceed 9. We shall meet an important example of this in 27.34 
below. Exercise 17.26, due to E. L. Lehmann, gives an example where an unbiassed 
estimator always gives absurd results. 


Corrections for bias 

17.10 If we have a consistent but biassed estimator ¢, and wish to remove its bias, 
this may be possible by direct evaluation of its expected value and the application of 
a simple adjustment, as in Example 17.3. But sometimes the expected value is a rather 
complicated function of the parameter, 0, being estimated, and it is not obvious what 
the correction should be. Quenouille (1956) has proposed an ingenious method of 
overcoming this difficulty in a fairly general class of situations. 

We denote our (biassed) estimator by ¢,, its suffix being the number of observa- 
tions from which 7, is calculated. Now suppose that #, is a function of the sample 
k-statistics k; (Chapter 12) which are unbiassed estimators of the population cumulants 
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x;, all of which are assumed to exist. If we may expand #, in a Taylor series about 8, 


we have 


2 
— = 5 (k— «)( 57) rote ees (1-0 (sea) es 


the derivatives being taken at the true values k; = «;. If we take expectations on both 
sides of (17.8), it follows from the fact (12.13) that the moments of the #-statistics are 
power series in (1/m) that we shall have 


E(t,)—0 = & a,/n. (17.9) 
r=1 


Of course, (17.9) may hold even if t, is not a function of the k-statistics, as in Exercise 
17.13. Now let 7,-, denote the statistic calculated for all the m subsets of (n—1) 
observations and averaged, and consider the new statistic 


t, = nt,—(n—1) t,-}. (17.10) 
It follows at once from (17.9) that 


; 1 1 1 1 
E(t,)—9 = ay (ai) + (5-@ay)t <=" 


= -2-0(n-). (17.11) 


Thus #), is only biassed to order 1/n?. Similarly, 
t, = {n®t,—(n—1)*t,_1}/{n?-(n—1)? } 
will be biassed only to order 1/n®, and so on. This method provides a very direct 


means of removing bias to any required degree. 

Exercise 17.18 is to show that var #/, = var t, to order 1/n, so that removal of first- 
order bias does not involve larger variance; but this is not generally true for further 
order corrections. 

R. G. Miller (1964) discusses the asymptotic normality of ¢;, and the estimation of its 


variance. 


Example 17.4 
To find an unbiassed estimator of 6? in the binomial distribution 


Pix =f} = (Toa, r= 1.2 


The intuitive estimator is 
T 2 
nN 


since r/n is an unbiassed estimator of 6. Now f,_, can only take the values 


(= i) - (* i) 


according to whether a “ success ” or a “ failure” is omitted from the sample. Thus 


a= Heenan) } =a 


ESTIMATION 7 


Hence, from (17.10), th = nt,—(n—1)t,-1 
ee iss er 


n n(n—1) 
— 2s) 
needs (17.125 


which, it may be directly verified, is exactly unbiassed for 62. 


17.11 In general there will exist more than one consistent estimator of a para- 
meter, even if we confine ourselves only to unbiassed estimators. Consider once again 
the estimation of the mean of a normal population with variance o?. The sample 
mean is consistent and unbiassed. We will now prove that the same is true of the 
median. 

Consideration of symmetry is enough to show that the median is an unbiassed 
estimator of the population mean, which is, of course, the same as the population 
median. For large m the distribution of the median tends to the normal form (cf. 
14.12) 

dF (x) o exp {—2nf,?(x—6)? } dx, (17,13) 
where f, is the population median ordinate, in our case equal to (270?)-?. The vari- 
ance of the sample median is therefore, from (17.13), equal to 20?/(2n) and tends to 
zero for large nm. Hence the estimator is consistent. 


17.12 We must therefore seek further criteria to choose between estimators with 
the common property of consistency. Such a criterion arises naturally if we consider 
the sampling variances of the estimators. Generally speaking, the estimator with the 
smaller variance will be distributed more closely round the value 6 ; this will certainly 
be so for distributions of the normal type. An unbiassed consistent estimator with 
a smaller variance will therefore deviate less, on the average, from the true value than 
one with a larger variance. Hence we may reasonably regard it as better. 

In the case of the mean and median of normal samples we have, for any n, from 


(17.3), 
var (mean) = o?/n, (17.14) 
and for large n, from 17.11, 
var (median) = zo?/(2n), (17.15) 
where o? is the parent variance. Since 2/2 = 1:57 > 1, the mean is more efficient 
than the median for large m at least. For small m we have to work out the variance 


of the median. ‘The following values may be obtained from those given in Table XXIII 
of Tables for Statisticians and Biometricians, Part II :— 


1 2 3 + 5 
var (median) 
eae 1-00 1:35 1-19 1-44 


It appears that the mean always has smaller variance than the median in estimating 
the mean of a normal distribution. 
B 
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Example 17.5 
For the Cauchy distribution 
1 dx 
ite 
“= ie oe 
we have already seen (17.6) that the sample mean is not a consistent estimator of 6, 


the population median. However, for the sample median, #, we have, since the median 
ordinate is 1/z, the large-sample variance 


—~o<s xg QQ, 


702 
var? = — 
4n 


from (17.13). It is seen that the median is consistent, and although direct comparison 
with the mean is not possible because the latter does not possess a sampling variance, 
the median is evidently a better estimator of 0 than the mean. ‘This provides an inter- 
esting contrast with the case of the normal distribution, particularly in view of the 
similarity of the distributions. 


Pitman (1937c) defined t to be “closer” to 6 than u is if P{| t—6|<| u—6@|}>3, 
but this attractive concept is intransitive, as he pointed out. See also Geary (1944) 
and Johnson (1950). 


Minimum variance estimators 

17.13 It seems natural, then, to use the sampling variance of an estimator as 
a criterion of its acceptability, and it has, in fact, been so used since the days of Laplace 
and Gauss. But only in relatively recent times has it been established that, under 
fairly general conditions, there exists a bound below which the variance of an estimator 
cannot fall. In order to establish this bound, we first derive some preliminary results, 
which will also be useful in other connexions later. 


17.14 If the frequency function of the continuous or discrete population is f(x| 4), 
we define the Likelihood Function) of a sample of n independent observations by 
L(y Mer «esa Xn $l Tey (OE eg 8) Pe (17.16) 
We shall often write this simply as L. Evidently, since L is the joint frequency func- 
tion of the observations, 


|... | Lavy... dey = (17.17) 


Now suppose that the first two derivatives of L with respect to 0 exist for all 0. If we 
differentiate both sides of (17.17) with respect to 0, and if we may interchange the opera- 
tions of differentiation and integration on its left-hand side, e.g. if the limits of inte- 
gration (i.e. the range of variation of x) are independent of 6, we obtain 


aL 
|---| Gydta-- dea = 0, 


(*)R. A. Fisher calls L the likelihood when regarded as a function of @ and the probability of 
the sample when it is regarded as a function of x for fixed 0. While appreciating this distinction, 
we use the term likelihood and the symbol L in both cases to preserve a single notation. 
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which we may rewrite 


dlogL\ _ 1 oL | = 
B( - ) . fo] (G3) ba de = (17.18) 


If we differentiate (17.18) again, we obtain, if we may again interchange operations, 


1 aL 0/1a0L 
= A4(z 30) 3 +L 5a(7 op) fats die = 0 
which becomes 
loL\? _ Flog lL See 


- ay(enty i -2(=38°), (17.19) 


17.15 Now consider an unbiassed estimator, t, of some function of 6, say 1(6). 
This formulation allows us to consider unbiassed and biassed estimators of 6 itself, 
and also permits us to consider, for example, the estimation of the standard deviation 
when the parameter is equal to the variance. We thus have 


E(t) = |... | eLdey... de, = (0): (17.20) 
We now differentiate (17.20), the result being 
,Glogh = 
for. SB Lidlty sv dy (0), 
which we may re-write, using (17.18), as 
2 (0)-= | | .- [ {t-1)} FE Lat, «ds, (17.21) 


By the Cauchy—Schwarz inequality, we have from (17.21) 
2 
f7" (0) }2 < { | L. [@-1(6)}*Ldey dy. | S | (AE) See 


which, on rearrangement, becomes 
vart = E {t—1(6)}2 > {r'(0) }2/E (A log =) | (17.22) 


This is the fundamental inequality for the variance of an estimator, often known as 
the Cramér—Rao inequality, after two of its several discoverers (C. R. Rao (1945) ; Cramér 
(1946) ) ; it was apparently first given by Aitken and Silverstone (1942). Using (17.19), 
it may be written in what is often, in practice, the more convenient form 
2 
vart > — ( () y/E(=28*), (17.23) 
We shall call (17.22) and (17.23) the minimum variance bound (abbreviated to MVB) 
for the estimation of t(@). An estimator which attains this bound for all 6 will be 
called a MVB estimator. 
It is only necessary that (17.18) hold for the MVB (17.22) to follow from (17.20). 
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If (17.19) also holds, we may also write the MVB in the form (17.23). See Exercises 
1/721 and 1722. 


17.16 In the case where ¢ is estimating 0 itself, we have t’(0) = 1 in (17.22) and 
for an unbiassed estimator of 0 


dlog L\? log L 
> =- : 
vat 1e| ( _ | 1/E( rs. (17.24) 
In this case the quantity J defined as 7 
dlog L\? 
= E| ( - (17.25) 


is sometimes called the amount of information in the sample, although this is not a 
universal usage. 


17.17 It is very easy to establish the condition under which the MVB is attained. 
The inequality in (17.22) arose purely from the use of the Cauchy—Schwarz inequality, 
and the necessary and sufficient condition that the Cauchy—Schwarz inequality becomes 
dlogL 


s for all sets of observa- 
00 


an equality is (cf. 2.7) that {t—1(6) } is proportional to 


tions. We may write this condition 
dlog L 
= 0 
where A is independent of the observations but may be a function of 6. ‘Thus (17.26) 
becomes 


= A. {t—r(6)}, (17.26) 


dlog L 


ia A (6) {t—1 (6) }. (17.27) 
Further, from (17.27) and (17.18), 
ee ee dleg Ly? |. : 
var ( Z )- B| ( - | = {A(6) }*vart, (17.28) 
and since in this case (17.22) is an equality, (17.28) substituted into it gives 
vart = |r’ (6)/A(6)|. (17.29) 


We thus conclude that if (17.27) is satisfied, t is a MVB estimator of 1 (6), with variance 
(17.29), which is then equal to the right-hand side of (17.23). If t(@) = 6, var? is 
just 1/A (0), which is then equal to the right-hand side of (17.24). 


Example 17.6 
To estimate 6 in the normal population 


ood 2 
dF (x) = sai {-3 (=) \ ar, —-o<x< , 


where o is known. 


We have ae —(*—6). 
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This is of the form (17.27) with | 
=2- AW) =njo*.-and (0) = 0. 
Thus « is the MVB estimator of 0, with variance o?/n. 


Example 17.7 
To estimate @ in 
dF (x) = 1 dx 


mt {1+(x—6)? 3’ 


We have 
dlog L x—O 
—— = 22- —_ 
00 {1+(x—6)? } 
This cannot be put in the form (17.27). ‘Thus there is no MVB estimator in this case. 


Example 17.8 
To estimate 0 in the Poisson distribution — | 
{(e|Q) = ee xl, x = 0, EARS sae 
We have | 
alogh _ (59) 
Thus # is the MVB estimator of 0, with variance 0/n. 


Example 17.9 


To estimate @ in the binomial distribution, for which 


L(r|6) = (7) ao ¢ 2 0,1,25.:<,o. 


dlogL sin '— 
06 0(1—6)\n 


Hence r/n is the MVB estimator of 6, with variance 0(1—6)/n. 


We find 


17.18 It follows from our discussion that, where a MVB estimator exists, it will 
exist for one specific function t(0) of the parameter 0, and for no other function of 6. 
The following example makes the point clear. 


Example 17.10 
To estimate 9 in the normal distribution 


1 x 
F(x) = ——— — —-O<X< Ow. 
dF (x) 0(2n)i exp ( 5 4 dx, ee OO 
We find 
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We see at once that : = x? is a MVB estimator of 6? (the variance of the population) with 
n 


3 4 
sampling variance = < (0) = = by (17.29). But there is no MVB estimator of 
6 itself. 


Equation (17.27) determines a condition on the frequency function under which 
a MVB estimator of some function of 0, 7(0), exists. If the frequency is not of this 
form, there may still be an estimator of 7(6) which has, uniformly in 0, smaller 
variance than any other estimator ; we then call it a minimum variance (MV) esti- 
mator. In other words, the least attainable variance may be greater than the MVB. 
Further, if the regularity conditions leading to the MVB do not hold, the least attain- 
able variance may be less than the (in this case inapplicable) MVB. In any case, 
(17.27) demonstrates that there can only be one function of 0 for which the MVB is 
attainable, namely, that function (if any) which is the expectation of a statistic ¢ in 
terms of which dlogL/00 may be linearly expressed. 


17.19 From (17.27) we have on integration the necessary form for the Likelihood 
Function (continuing to write A(0) for the integral of the arbitrary function A (4) in 
(17.27) ) 

log L = tA(0)+P(0)+R(x,, Xo, ~~~ 5 Xn), 
which we may re-write in the frequency-function form 


f(x|0) = exp {A (0) B(x)+C(x)+D(6) }, (17.30) 
where t = © B(x,), R(x, --+, 4%) = © C(x; and P(@) = nD(6). (17.30) is often 
i=1 i=1 
called the exponential family of distributions. We shall return to it in 17.36. 
17.20 We can find better (i.e. greater) lower bounds than the MVB (17.22) for the 


variance of an estimator in cases where the MVB is not attainable. ‘The essential condi- 
tion for (17.22) to be an attainable bound is that there be an estimator ¢ for which 


t—t(6) is a linear function of oat = z a But even if no such estimator exists, 
. e * e e 1 oL 1 ek 
th till b for which t—1(0 I — a 
ere may still be one for which t—7(6) is a linear function of 30 and i og7 


in general, of the higher derivatives of the Likelihood Function. It is this fact which 
leads to the following considerations, due to Bhattacharyya (1946). 
Estimating t(0) by a statistic ¢ as before, we write 


Ll 
es 
op" 
and 7) = a) 


ao 
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We now construct the function 3 
D, = t—t(0)— Xa, L/L, (17.31) 
fut 


where the a, are constants to be determined. Since, on differentiating (17.17) r times, 
we obtain 


E(L/L) = 0 (17.32) 
under the same conditions as before, we have from (17.31) and (17.32) 
E(D,) = 0. (17.33) 
The variance of D, is therefore 
EF (D3) = | ra | {(t—1(0))—Xa, L/L PLdx,... dx. (17.34) 


We minimize (17.34) for variation of the a, by putting 
(p) 
| {(t-1(0)) Za, L/L} FL. dey = O . (17.35) 


for p= 1, 7,...50 2s Ges 


LO L® 
| evi | (t—1(0)) L® dx,... dx, = Ea, | . | ae 2 dy ESS 


The left-hand side of (17.36) is, from (17.32), equal to 
| oe | tL dv, ia, = 


on comparison with (17.20). The right-hand side of (17.36) is simply 


LO L® 
a, E( 7 =) 
On insertion of these values in (17.36) it becomes 
s LY L® 
eo) = 3 a8(4-7), Se ee eres - (17.37) 


We may invert this set of linear equations, provided that the matrix of coefficients 


Le iO" . 
ce ( is non-singular, to obtain 


= ee 
r— E x) tye eee (17.38) 
Thus, at the minimum of 17.34), (17.31) takes the value 
D, = 1—7(6) -> 2 oJ L/L, (17.39) 


and (17.34) itself has the value, from (17.39), 
E(D2) = |... | (-1)-EEe Jo L/L} Lede... diy 
r p 
which, on using (17.35), becomes 


= | ae | {t—1 (6) 20 dx, ee div, —Z EU Jos | itn 3 [de eee i 
r p 
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and finally we have 


E(D?) = vart—XLdir™ J ™. (17.40) 
rp : 
Since its left-hand side is non-negative, (17.40) gives the required inequality 
vart > XY ¥ re Ja312, (17.41) 
r=1 p=1 


In the particular case s = 1, (17.41) reduces to the MVB (17.22). 


17.21 The condition for the bound in (17.41) to be attained is simply that 
E(D;) = 0, which with (17.33) leads to D, = 0 or, from (17.39), 


t—r(6) = = z a) Ja) LO/L, (17.42) 


which is the generalization of (17.27). (17.42) requires t—7(6) to be a linear function 
of the quantities L/L. If it is a linear function of the first s such quantities, there 
is clearly nothing to be gained by adding further terms. On the other hand, the 
right-hand side of (17.41) is a non-decreasing function of s, as is easily seen by the con- 
sideration that the variance of D, cannot be increased by allowing the optimum choice 
of a further coefficient a, in (17.34). ‘Thus we may expect, in some cases where the 
MVB (17.22) is not attained for the estimation of a particular function 7(@), to find 
the greater bound in (17.41) attained for some value of s > 1. 


17.22 We now investigate the improvement in the bound arising from taking 
s = 2 instead of s = 1 in (17.41). Remembering that we have defined 


L® L® 
j= E(-—), (17.43) 


we find that we may rewrite (17.41) in this case as 


vari +73 
tT Jar Jin) 2 9, (17.44) 
Tv Jia Sop 
which becomes, on expansion, 
\2 t ae 9 
vant MD act J) (17.45) 


Jiu Ju(Jutes—Ji) 


The second item on the right of (17.45) is the improvement in the bound to the vari- 
ance of t. It may easily be confirmed that, writing J,,(m) as a function of sample size, 


Jiur() = nJi1(), 
J12(”) = nJ12(1), 
J22(m) = nJo2(1)+2n(n—1) {Jui (1) J, 
and using (17.46), (17.45) becomes 
ape 0), tau @ ae 
== al 2n? {Jii(1) 2 =~ () vere 


which makes it clear that the improvement in the bound is of order 1/n? compared 


(17.46) 
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with the leading term of order 1/n, and is only of importance when t’ = 0 and the 
first term disappears. In the case where t¢ is estimating 6 itself, (17.47) becomes 


1 (J12(1) }* (i “s) 
vari > ——.-+ +0 ; 
mJy,(1) 2m? {J41(1) }4 n* 
which we may rewrite, using (17.46) to return to our original notation, 
L. 1 
—T Jiz 14 0(- :) 17.48 
ee — 


If the bound with s = 2 equals the MVB (s = 1), this. does not generally imply that 
the latter is attainable. For the exponential family (17.30), however, this equality does 
imply attainability—cf. Patil and Shorrock (1965). 


Example 17.11 


To estimate 9(1—6) in the binomial distribution of Example 17.9, it is natural to 
take as our estimator an unbiassed function of r/n = p, which is the MVB estimator 
of 9. We have seen in Example 17.4 that r(r—1)/ {n(n —1) } is an unbiassed estimator 


of 62. Hence 
Yr r(r—1) nN 
is -—— = ("5 ) 0 —p) 


is an unbiassed estimator of §(1—6), and its large-sample variance is given, to the first 
order of approximation (cf. (10.14)), by 


vart ~ (4) \z Ee -») \ Ss ~ 6(1—6)(1—26)?/n. 


Since t’ = (1—26), and J,, = n/{6(1—6) }, this is the first term of (17.41) and (17.45), 
so that ¢ = asymptotically a MVB estimator if this leading term is non-zero. But 
when 0 = 1, it is zero, so we must consider the second term in the series to obtain a 
non-zero bound to the variance. It is easily verified that 
_ _ 2n(1—20) 
Ji2 = ~ (1-6)? 
and that the second term in (17.47) is 


Z 

= {(1—20)?—6 (1-6) }”. 
When 6 = i, this is equal to 1/(8”?). By writing 

t= (45) G-(0-9} 


the exact variance of t when 6 = 3 is obtainable from the moments of the binomial dis- 
tribution as vart = 1/{8n(n—1) }, agreeing with the second term in (17.47) to order 1 /n?. 


17.23 Example 17.11 brings out one point of some importance. If we have a 
MVB estimator ¢ for 7(6), 1.e. 


vart = {r'(6) }°/Jan (17.49) 
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and we require to estimate some other function of 6, say w {x (0) }, we know from our 
earlier work that in large samples 


var {p(t) } ~ ($e) vare 


~ (35) (a) /J 


from (17.49), provided that y’ is non-zero. But this may be rewritten 


var 9} ~ (FE) [Tus 


so that any function of 0 has a MVB estimator in large samples if some function of 6 
has such an estimator for all ». Further, the estimator is always the corresponding 
function of ¢. We shall, in 17.35, be able to supplement this result by a more exact 
one concerning functions of the MVB estimator. 


17.24 There is no difficulty in extending the results of this chapter on MVB 
estimation to the case when the distribution has more than one parameter, and we are 
interested in estimating a single function of them all, say 7(6,, 02,...,6,). In this 
case, the analogue of the simplest result, (17.22), is 


k k 
var2e es Pepe (17.50) 


ee ES 
w=1 j=1 Z F) 
where the matrix which has to be inverted to obtain the terms in (17.50) is 


| i-dt-1 dL. 
Oo) = {2 (rca, } 


As before, (17.50) only takes account of terms of order 1/n, and a more complicated 
inequality is required if we are to take account of lower-order terms. 


17.25 Even better lower bounds for the sampling variance of estimators than those 
we have discussed can be obtained, and remarkably enough this can be done without 
imposing regularity conditions. The following treatment of the case of unbiassed estim- 
ators is based on that of Kiefer (1952). 

Writing L («| 0) for the Likelihood Function as at (17.16), let h be a random variable 
independent of x so defined that (+h) ranges over all admissible values of 6, and let 
1, (h), 4,(h) be two completely arbitrary distribution functions. ‘Then if an estimator t 
is unbiassed for 0, we have 


| {t—(0+h) }L(«|6+h) dx = 0, 


where we have written a single integral and ‘“‘ dx’ to denote the n-fold integration re- 
quired. We now integrate again, with respect to 4;(h), to obtain, for i = 1, 2, 


| | e-@+mI2(e|+i ae) dhi(h) = 0. 


Thus we may write 


E, (h)—E,(h) = | e-0 [| 2clo+mae@,0—2.009] dx, 


: [e-9 {L (x |8) }! | elorinda—a9 dx. 


{Le 
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Using Schwarz’s inequality, we have at once : 
| L(x |6+h) d(A,—A,) 
{L(x|0) 


dx, 


{E, (h)-—E,(h) }° < | e-9F Lela | 


which yields 
{E, (h) —E, (h) }? 


vari > ~ 3 ; 
| [{ | eelo+macs—aa | /L010)] dx 


This is true for every 4, and A4,. Thus 


vart > sup —— (Pes EO (17.51) 


= [{ [colo+macs-a9b /ze10)] dx, 


the supremum being taken for all 2, # A, for which the integrand with respect to x is 
defined. Barankin (1949) has shown that inequalities of this type give the best possible 
bounds for var t. 

Now suppose that /,(h) degenerates to a point at h = 0. (17.51) becomes 


{E, (A) } 


vart > sup = : (17.52) 
= | [{ | eolo+maa,} [Li dx —1 
If we further allow 4, to degenerate to a point at h #0, (17.52) becomes 
vari 2 — : ae (17.53) 


ci{f[ (ean) ftv] 


the infimum being for all h 4 0 for which L(x|@) = 0 implies L(x|6+h) = 0. (17.53), 
which was established by Chapman and Robbins (1951), is in general at least as good a 
bound as (17.24), though more generally valid. For the denominator of the right-hand 
side of (17.24) is 


@logL\?] [L(x |0+h)—L(x|6)]? 
[CY] fan“ s Yeon 


and provided that we may interchange the integral and the limiting operation, this becomes 


= tt FP ILGL 
= tim | | ACT a |. 


The denominator on the right of (17.53) is the infimum of this quantity over all permissible 
values of h, and is consequently no greater than the limit as h tends to zero. ‘Thus (17.53) 
is at least as good a bound as (17.24). It follows a fortiori that (17.52) and (17.51) are even 
better bounds in general, but they are subject to the difficulty of locating the appropriate 
forms of A, and A, in each particular case: (17.53) is applicable directly. We shall find 
that in cases where the range of a distribution is a function of a parameter, and (17.24) 
does not apply, it will be possible to establish the properties of ‘‘ best ’’ estimators more 
expeditiously by consideration of sufficient statistics, which we shall be discussing later 
in this chapter. 


17.26 So far, we have been largely concerned with the uniform attainment of the 
MVB by a single estimator. In many cases, however, it will not be so attained, even 
when the conditions under which it is derived are satisfied. When this is so, there may 
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still be a MV estimator for all values of 6. The question of the uniqueness of a MV 
estimator now arises. We can easily show that if a MV estimator exists, it is always 
unique, irrespective of whether any bound is attained. 
Let ¢, and t, be MV unbiassed estimators of 1 (9), each with variance V. Consider 
the new estimator 
ts = 3(t,+%2) 
which evidently also estimates t(#) with variance 
vart, = } {vart,+vart,+2cov(t,, t,) }. (17.54) 
Now by Schwarz’s inequality 


cov (t,,t,) = |---| G-)G-Lde... ds, 


. {| ors | (.-#)* Ld, vee dir. | vee [(4s—n)PLdy,...dsyh 


< (vart, vart,) < Tf. (17.55) 
Thus (17.54) and (17.55) give 
vari, =< ¥, 
which contradicts the assumption that ¢, and t, have MV unless the equality holds. 
This implies that the equality sign holds in (17.55). This can only be so if 
(t;—t) = k(0)(t.—T), (17.56) 
i.e. the variables are proportional. But if this is so, we have from (17.56) 
cov (t,t) = k(0)vart, = k(0)V 
and this equals V since the equality sign holds in (17.55). Thus 
k(0) = 1 
and hence, from (17.56), f; = ¢, identically. Thus a MV estimator is unique. 


17.27 The argument of the preceding section can be generalized to give an inter- 
esting inequality for the correlation p between any two estimators, which, it will be 
remembered from 16.23, is defined as the ratio of their covariance to the square root 
of the product of their variances. By the argument leading to (17.55), we see that 
p? < 1 always. 

Suppose that ¢ is the MV unbiassed estimator of 7 (6), with variance V, and that 
t, and f, are any two other unbiassed estimators of t(0). Consider a new estimator 

tz = at,+(1—a)t,. 
This will also estimate 7(0), with variance 
vart, = a*vart,+(1—a)?vart,+2a(1—a) cov (é, t,). (17.57) 
If we now write 
yart, = 8, V, Nati, = 8,V, — &,, Ry > 1, 
we obtain, from (17.57), 
vart,;/V = a?k,+(1—a)?k,+2a(1—a) cov (é,, t.)/V, (17.58) 
and writing 
< eer tg “covGs2,) 
P™ (vart,varts)! (Ry Rs)tV 
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we obtain from (17.58) | 
vart,/V = a*k,+(1—a)?k,+2a(1—a) p(ki ke)’, 

and since vart, > V, this becomes 

a?k,+(1—a)?k,+2a(1—a) p(kik,)* > 1. (17.59) 

(17.59) may be rearranged as a quadratic inequality in a, 
a? {k,+k,—2p(k,k.)! }+2a {p (ky ke)* —Re $+(k2—1) 2 9, 

and the discriminant of the left-hand side cannot be positive, since the roots of the 
equation are complex or equal. ‘Thus 


{o (ky Ry)t— ke }2 < {ky +k,—2p (ki kz)? } (Ro— 1) 


which yields {o(k,R.)?—-1}? < (Rky—1)(A2— 1). 
Hence 
ae | ee et cee 
(Rk, R)? kk, (ky Ro)? kk, 


or, finally, writing £, = 1/k,, E, = 1/hy, 
(E, E,)!— {(1—E,)(1-E) }! < p < (E,Es)§+ ((1—-E,)(1- Ey) }4.—_ (17.60) 
If either E, or E, = 1, ie. either ¢, or ¢, is the MV estimator of 1 (6), (17.60) collapses 
into the equality 
p = E, (17.61) 
where E is the reciprocal of the relative variance of the other estimator, generally called 
its efficiency in large samples, for a reason to be discussed in 1728-9. 


Efficiency 

17.28 So far, our discussion of MV estimation has been exact, in the sense that 
it has not restricted sample size in any way. We now turn to consideration of large- 
sample properties. Even if there is no MV estimator for each value of n, there will 
often be one as tends to infinity. Since most of the estimators we deal with are 
asymptotically normally distributed in virtue of the Central Limit theorem, the distribu- 
tion of such an estimator will depend for large samples on only two parameters—its 
mean value and its variance. If it is a consistent estimator it will commonly be asympto- 
tically unbiassed—cf. 17.9. This leaves the variance as the means of discriminating 
between consistent, asymptotically normal estimators of the same parametric function. 

Among such estimators, that with MV in large samples" is called an efficient esti- 
mator, or simply efficient, the term being due to Fisher (1921a). It follows from the 
result of 17.26 that efficient estimators tend asymptotically to equivalence. 


17.29 Since we generally compare asymptotically normal estimators in large 
samples, we can reasonably set up a measure of efficiency, something we have not 
attempted to do for small m and arbitrarily-distributed estimators. We shall define 
the efficiency of any other estimator, relative to the efficient estimator, as the reciprocal 
of the ratio of sample numbers required to give the estimators equal sampling variances, 
i.e. to make them equally precise. 


(*) But see the discussion of “‘ superefficiency ”’ in 18.17. 
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If the efficient estimator is ¢,, and f, is another estimator, and (as is generally the 
case) the variances are, in large samples, simple inverse functions of sample size, we 
may translate our definition of efficiency into a simple form. Let us suppose that 


V, = var(t,|2,) ~ a,/n' tt 
z ( 1| 1) 1/ : ( } (17.62) 
V, = var(t,|m2) ~ ap/ni = (s > 0), 
where 4@,,a, are constants independent of m, and we have shown sample size as an 

argument in the variances. If we are to have V, = V,, we must have 


i< 3. eee 
Vy pono Ga™, 
Thus 
“ = lim” = lim (my. = (17.63) 
ay n', ny} nN; 


If ¢, is efficient, we must have r > s. If 7 > s, the last factor on the right of (17.63) 
will tend to zero, and hence, if the product is to remain equal to a,/a,, we must have 
n 
mes CO, r > s, (17.64) 
1 


and we would thus say that ¢, has zero efficiency. If, in (17.63), r = s, we have at once 


1/r 
; nN a 
Ny ay 


which from (17.62) may be written 
: Ne ss V~, 1/r 
lim = lim (7) 
and the efficiency of f, is the reciprocal of this, namely 


1/r 
Bein (7) | (17.65) 


2 
Note that if r > s, (17.65) gives the same result as (17.64). Ifr = 1, which is the most 
common case, (17.65) reduces to the inverse variance-ratio encountered at the end of 
17.27. ‘Thus, when we are comparing estimators with variances of order 1/n, we 
measure efficiency relative to the efficient estimator by the inverse of the variance-ratio. 
If the variance of the efficient estimator is not of the simple form (17.62), the 
measurement of relative efficiency is not so simple. Cf. Exercise 18.21. 


Example 17.12 


We saw in Example 17.6 that the sample mean is a MVB estimator of the mean yu 
of a normal population, with variance o?/n. A fortiori, it is the efficient estimator. 
We saw in Example 11.12 that it is exactly normally distributed. In 17.11-12, we 
saw that the sample median is asymptotically normal with mean yw and variance 2 o?/ (2n). 
Thus, from (17.65) with r = 1, the efficiency of the sample median is 2/z = 0-637. 


Example 17.13 


Other things being equal, the estimator with the greater efficiency is undoubtedly 
the one to use. But sometimes other things are not equal. It may, and does, happen 
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that an efficient estimator ¢, is more troublesome to calculate than an alternative ?¢,. 
The extra labour involved in calculation may be greater than the saving in dealing 
with a smaller sample number, particularly if there are plenty of further observations 
to hand. 

Consider the estimation of the standard deviation of a normal population with 
variance o? and unknown mean. ‘Two possible estimators are the standard deviation 
of the sample and the mean deviation of the sample multiplied by (7/2)? (cf. 5.26). 
The latter is easier to calculate, as a rule, and if we have plenty of observations (as, for 
example, if we are finding the standard deviation of a set of barometric records and 
the addition of further members to the sample is merely a matter of turning up more 
records) it may be worth while estimating from the mean deviation rather than from 
the standard deviation. Both estimators are asymptotically normally distributed. 


In large samples the variance of the mean deviation is (cf. (10.39) ) (1 -*), 


The variance of the estimator of o from the mean deviation is then approximately 


7% Oo ps o° 
ot (1-2) = F (n-2). 


n 


2 
Now the variance of the standard deviation (cf. 10.8(d)) is oo and we shall see 


later that it is an efficient estimator. Thus the efficiency of the first estimator is 
o e 
Ee x / ee?) seed Mara DY 0876: 


The accuracy of the estimate from the mean deviation of a sample of 1000 is then 
about the same as that from the standard deviation of a sample of 876. If it is easier 
to calculate the m.d. of 1000 observations than the s.d. of 876 and there is no shortage 
of observations, it may be more convenient to use the former. 

It has to be remembered, nevertheless, that in adopting such a procedure we are 
deliberately wasting information. By taking greater pains we could improve the 
efficiency of our estimate from 0-876 to unity, or by about 14 per cent of the former 
value. 


Minimum mean-square-error estimation 


17.30 Our discussions of unbiassedness and the minimization of sampling vari- 
ance have been conducted more or less independently. Sometimes, however, it is 
relevant to investigate both questions simultaneously. It is reasonable to argue that 
the presence of bias should not necessarily outweigh small sampling variance in an 
estimator. What we are really demanding of an estimator f is that it should be “ close ” 
to the true value 0. Let us, therefore, consider its mean-square-error about that true 
value, instead of its mean-square-error about its own expected value. We have at once 

E(t—0)? = E{(t—E(t))+(E(d)—96) }? = vart+ {E(t) — 6}, 
the cross-product term on the right being equal to zero. The last term on the right 
is simply the square of the bias of ¢ in estimating 0. If ¢ is unbiassed, this last term 
is zero, and mean-square-error becomes variance. In general, however, the minimiza- 
tion of mean-square-error gives different results. 


22 THE ADVANCED THEORY OF STATISTICS 


Example 17.14 


What multiple of the sample mean # estimates the population mean yw with mini- 
mum mean-square-error? We have, from previous results, E(ax) = au, var(ax) = 
a®o?/n, where o? is the population variance and nm is sample size, and thus we have 

E(ax—p)? = a®o?/n+ ww? (a—1). 
For variation in a, this is minimized when 
2a0?/n+2u?(a—1) = 0, 
grad ss 
Li? +0? /n 
As n— o, a— 1, and we choose the unbiassed estimator, but for any finite n, a < 1. 

If there is some known functional relation between yu and o?, we can take the matter 
further. For example, if o? = u?, we obtain simply a = n/(n+1). 

Evidently considerations of this kind will only be of use in determining estimators 
when something is known of the relation between the parameters of the distribution 
from which we are sampling. Minimum mean-square-error estimators are not much 
used, but it is as well to recognize that the objection to them is a practical, rather than 
a theoretical one. In a sense, MV unbiassed estimators are tractable because they 
assume away the difficulty by insisting on unbiassedness. 


i.e. when 


Sufficient statistics 

17.31 The criteria of estimation which we have so far discussed, namely con- 
sistency, unbiassedness, minimum variance and efficiency, are reasonable guides in 
assessing the properties of an estimator. ‘To permit amore fundamental discussion, we 
now introduce the concept of sufficiency, which is due to Fisher (192la, 1925). 

Consider first the estimation of a single parameter 6. ‘There is an unlimited number 
of possible estimators of 6, from among which we must choose. With a sample of n > 2 
observations as before, consider the joint distribution of a set of r functionally inde- 
pendent statistics, f,(t,t),t2,.-.,t-1|0), r = 2,3,...,m, where we have selected 
the statistic ¢ for special consideration. Using the multiplication theorem of proba- 
bility (7.9), we may write this as the product of the marginal distribution of ¢ and the 
conditional distribution of the other statistics given f¢, 1.e. 


Si ht tec sg hal Oy = tO es (17.66) 

Now if the last factor on the right of (17.66) is independent of 6, we clearly have a 

situation in which, given f, the set ¢,,..., ¢,_1 contribute nothing further to our know- 

ledge of 6. If, further, this is true for every r and any set of (r—1) statistics ¢;, we 

may fairly say that ¢ contains all the information in the sample about 0, and we there- 

fore call it a sufficient statistic for 6. We thus formally define ¢ as sufficient for 6 if 
and only if 


Fe bith es hee | Or EO 4 GG, eee (17.67) 
where h,_, is independent of 0, for r = 2,3,...,m and any choice of #,,..., #1. 


(*) This definition is usually given only for r = 2, but the definition for all r seems to us more 
natural. It adds no further restriction to the concept of sufficiency. 
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17.32 As it stands, the definition of (17.67) does not enable us to see whether, in 
any given situation, a sufficient statistic exists. However, we may reduce it to a con- 
dition on the Likelihood Function. For if the latter may be written 


L (xy, «. +5 Xn {0) = 2 (t|0)R (x4, -. +5 Xn)s (17.68) 


where g(¢ | 0) is a function of ¢ and 6 alone“) and k is independent of 6, it is easy to 
see that (17.67) is deducible from (17.68). For any fixed r, and any set of t,,..., t,_y 
insert the differential elements dx,...dx, on both sides of (17.68) and make the 
transformation 


oe eee 
fsa «1G, 2 = 1 og f=, 
i, = &. = 7... 8—1. 


The Jacobian of the transformation will not involve 9, and (17.68) will be transformed to 
n—1 

Bs a eis + =a te) 11 dt, (17.69) 
i=1 


and if we now integrate out the redundant variables ¢,,...,t%,_1, we obtain, for the 
joint distribution of 7, ¢,,...,¢,_1, precisely the form (17.67). 

It should be noted that in performing the integration with respect to t,,..., t,—1, 
1.€. X;,...,X,—1, we have assumed that no factor in 6 was thereby introduced. This 
is clearly so when the range of the distribution of the underlying variable is independent 
of 6; we shall see later (17.40-1) that these integrations remain independent of 6 when 
one terminal of the range depends on 6, and also when both terminals depend on 0. 

The converse result is also easily established. In (17.67) with r = n, put t; = x; 
(@ = 1,2,...,m—1). We then have 


ti (Z, M1) V9, 2 6 © yMn-1 | 0) ss g(t|O)hy_1 cm 7 | E. (17.70) 


On inserting the differential elements didx,...dx,_, on either side of (17.70), the 
transformation 
ee ee re eee 2 
ex Se eee ee 


applied to (17.70) yields (17.68) at once. ‘Thus (17.67) is necessary and sufficient 
for (17.68). ‘This proof deals only with the case when the variates are continuous. 
In the discrete case the argument simplifies, as the reader will find on retracing the 
steps of the proof. A very general proof of the equivalence of (17.67) and (17.68) is 
given by Halmos and Savage (1949). 

We have discussed only the casen > 2. Form = 1 we take (17.68) as the definition 
of sufficiency. 


() We retain the notation g(z| 6), since the function of t and @ may always be expressed as 
the marginal distribution of ¢. 
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Sufficiency and minimum variance 

17.33 The necessary and sufficient condition for sufficiency at (17.68) has one 
immediate consequence of interest. On taking logarithms of both sides and differenti- 

ating, we have 
dlogL  dlogg(#| 6) 
= eg ape (17.71) 
On comparing (17.71) with (17.27), the condition that a MVB estimator of 7 (6) exist, 
we see that such an estimator can only exist if there is a sufficient statistic. In fact, 


(17.27) is simply the special case of (17.71) when 
dlogg(t| A) 
Great A (0) {t—7 (6) }. (17.72) 


Thus sufficiency, which perhaps at first sight seems a more restrictive criterion than 
the attainment of the MVB, is in reality a less restrictive one. For whenever (17.27) 
holds, (17.71) holds also, while even if (17.27) does not hold we may still have a sufficient 
statistic. 


Example 17.15 
The argument of 17.33 implies that in all the cases (Examples 17.6, 17.8, 17.9, 
17.10) where we have found MVB estimators to exist, they are also sufficient statistics. 


Example 17.16 
Consider the estimation of @ in 
dF (x) = dx/#. 0< x <8. 
The Likelihood Function (LF) is 
Lix|Qs 2 2 wes h 
= otherwise, 
since we know that all the observations, including the largest of them, x,,, cannot 
exceed 0. 
We may write this in the form 
L(x | 6) = 0 u(6 —Xy,), (17.73) 
where 
uz)=1, 22 of 
= Gag 0). 


(17.73) makes it clear at once that the LF can be factorized into a function of x,,, and 
6 alone, 

8(%n | 0) = O-" uO — Xu); 
and a second factor 


ce Nera eS 
Thus, from (17.68), x.) is a sufficient statistic for 6. 
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17.34 ‘The sufficient statistic, as defined by (17.68), is unique, except that if t is 
sufficient, any one-to-one function of t will also be sufficient. For example, if we 
take ¢ = ¢(u), and we may write (17.68) as 
k(x 
L (| 8) = g(¢|6).12(u) |. 2, 


[2 (w)|’ 


L(x | 0) = gx(w | Oks (3). 
where g,(u| 0) is the frequency function of u, and k, is independent of 6. Thus u is 
also sufficient for 6. Such ambiguities offer no difficulty in practice, for we simply 
choose a function of ¢ which is a consistent estimator, and usually also an unbiassed 
estimator, of 6. Functions of ¢ which are not one-to-one may also be sufficient in 
particular cases. Cf. Exercise 23.31. 

Apart from this, the sufficient statistic is unique. For if there were two distinct 
sufficient statistics, ¢; and ¢,, we should have, from (17.67) with r = 2, 


this may be rewritten 


Fa (tr t2| 9) = 81 (t1| 9) 1 (¢2| #1) = 82 (t2|0) a (t1| te); (17.74) 
so that we = write, from (17.74), the functional relationship 
t, = k(#,, 6). (17.75) 


But ¢, and ¢, are functions of the observations on and not of 6. Hence, from (17.75), 
t, is functionally related to ?,. 


17.35 We have seen in 17.33 that a sufficient statistic provides the MVB estimator, 
where there is one. We now prove a more general result, due to C. R. Rao (1945) and 
Blackwell (1947), that irrespective of the attainability of any variance bound, the MV 
unbiassed estimator of 1 (6), if one exists, is always a function of the sufficient statistic. 

Let ¢ be sufficient for 0, and ¢#, another statistic with finite variance and 


E(t,) = 7(8). | (17.76) 
Because of (17.68), we may write (17.76) as 
r(0) = | z= | tg (19) Rls oo) eee (17.77) 
We transform the right-hand side of (17.77) to a new set of variables #, x5, 73, .. . 5 Xn 
and integrate out the last (n—1) of these, obtaining 
7(0) = | p(t) (t| 6) dt. (17.78) 


(17.78) shows that there is a function p(Z) of the sufficient statistic which is unbiassed 
for t(0). Furthermore, from the derivation of (17.78) we see that p(¢) = E(t, | t). Now 
vart, = B it,—1() ? = E ([t1—p@)]+ 2 @—-7(6)] ? 

= Eit—p) P+ {p()-7() }, 
since FE {t,—p(t) } {p()—1(6) } = 0 on taking the conditional expectation given t. 
Thus vart, = E {t,—p(t) }?+ var {p(t) } > var {p(O}, 
and the equality holds if and only if ¢, = p(t). Thus p(t) = E(t, | #) has smaller 
variance than t,. We shall see in 23.9-11 that usually there is only one function of 
a sufficient statistic with any given expectation. Then, whatever t, we start from, 
E(t, | t) must be the same, and it is the unique unbinesed estimator of 1(0). 
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It is not always easy to use this result constructively, since E(¢, | t) may be difficult 
to evaluate—Exercise 17.24 gives an important class of cases where explicit results can 
be obtained—but it does assure us that an unbiassed estimator with finite variance 
which is a function of a sufficient statistic is the unique MV estimator. 


Distributions possessing sufficient statistics 

17.36 We now seek to define the class of distributions in which a sufficient statistic 
exists for a parameter. We first consider the case where the range of the variate does 
not depend on the parameter 0. From (17.71) we have, if ¢ is sufficient for @ in a 
sample of m independent observations, 


dlogL 2” dlogf(x;| 9) 

—— = YS SS KG 
where K is some function of t and 0. Regarding this as an equation in t, we see that 
it remains true for any particular value of 0, say zero. It is then evident that ¢ must 


be expressible in the form 
i= M4 x R(x;) \ (17.80) 
j=1 


where M and k are arbitrary functions. If w = XA(x,), then K is a function of 6 
and w only, say N(6,w). We have then, from (17.79), if the derivatives exist, 
elogL aN ow 
ee 17.81 
00 Ox; Ow Ox; ( ) 
Now the left-hand side of (17.81) is a function of 6 and x; only and dw/dx; is a function 
of x;only. Hence @N/dw is a function of 0 and x; only. But it must be symmetrical in 
the x’s and hence is a function of 9 only. Hence, integrating it with respect to w, we have 
N(6,w) = wp (0)+9(), 
where p and q are arbitrary functions of 6. ‘Thus (17.79) becomes 


Flog L ~ Fdlog f(x;|0) = p(0)=R(x,;) +9 (4), (17.82) 
J 
whence 


Slog f (+16) = p(6)k(x)+4(9)/n, 


giving the necessary condition for a sufficient statistic to exist, 
f(x|0) = exp {A (0) B(x)+ C(x) + D(8) }. (17.83) 
This result, which is due to Darmois (1935), Pitman (1936) and Koopman (1936), 
is precisely the form of the exponential family of distributions, obtained at (17.30) 
as a condition for the existence of a MVB estimator for some function of 8. 
L. Brown (1964) gives a rigorous treatment of the regularity conditions sufficient for 
this result to hold, with references to related work. 

If (17.83) holds, it is easily verified that if the range of f(x|6) is independent of 6, 
the Likelihood Function yields a sufficient statistic for 0. ‘Thus, under this condition, 
(17.83) is sufficient for the distribution to possess a sufficient statistic. 

All the parent distributions of Example 17.15 are of the form (17.83). 
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17.37 Under regularity conditions, there is therefore a one-to-one correspondence 
between the existence of a sufficient statistic for 0 and the existence of a MVB estimator 
of some function of 6. If (17.83) holds, a sufficient statistic exists for 6, and there 
will be just one function, ¢, of that statistic (itself sufficient) which will satisfy (17.27) 
and so estimate some function 1(9) with variance equal to the MVB. In large samples, 
moreover (cf. 17.23), any function of the sufficient statistic will estimate its expected 
value with MVB accuracy. Finally, for any m (cf. 17.35), any function of the sufficient 
statistic will have the minimum attainable variance in estimating its expected value. 


Sufficient statistics for several parameters 


17.38 All the ideas of the previous sections generalize immediately to the case 
where the distribution is dependent upon several parameters 0,,...,0,. It also makes 
no difference to the essentials if we have a multivariate distribution, rather than a uni- 
variate one. Thus if we define each x, as a vector variate with p(> 1) components, 
t and t; as vectors of statistics, and 8 as a vector of parameters with k components, 
17.31 and 17.32 remain substantially unchanged. If we can write 

L(x|6) = g(t| 6) A(x) (17.84) 
we call the components of t a set of (jointly) sufficient statistics for 8. ‘The property 
(17.67) follows as before. 

If t has s components, we may have s greater than, equal to, or less than k. If 
s = 1, we may call t a single sufficient statistic. (Thus the term “ sufficient ” used 
earlier in this chapter should now be read “single sufficient”’.) If we put t= x 
we see that the observations themselves always constitute a set of sufficient statistics 
for @withs = n. Inorder to reduce the problem of analysing the data as far as possible, 
we naturally desire s to be as small as possible. Even this is not quite restrictive enough 
—see Exercise 18.13. In Chapter 23 we shall define the concept of a minimal set 
of sufficient statistics for 8, which is a function of all other sets of sufficient statistics. 

It evidently does not follow from the joint sufficiency of t for @ that any particular 
component of t, say #, is individually sufficient for 0,. This will only be so if g(t| 8) 
factorizes with g, (¢'? | 6,) as one factor. Nor is the converse true always: individual 
sufficiency of all the ¢ when the others are known does not imply joint sufficiency. 

If k = 1, the result of 17.35 holds unchanged if ¢ is a vector with s < m components 
(if s = n, the result is an empty one); the result is most easily applied with s as small 
as possible. 


Example 17.17 
Consider the estimation of the parameters w and o? in 


dF (x) = saan? 1-3(“s") fam epee e209 


1 12 (x—p\ 
L («| “, 07) = aapReP{—32( —) \ (17.85) 


and we have seen (Example 11.7) that the joint distribution of # and s? in normal 


We have 
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samples is 


1 n 1 ns? 
y o2 2 =e = 2 n—3 uate: 
& (&, s?| uw, 0?) a Fexp { 753% Lt) has exp { ap 
so that, remembering that &(x—j)? = n{s?+(%—y)? }, we have 


L(x| 44,0) = g(,s"| u, 02) k(x) 
and therefore ¥ and s? are jointly sufficient for wand o®. We have already seen (Examples 
17.6, 17.15) that # is sufficient for ~ when o? is known and (Examples 17.10, 17.15) that 


a2 (x—j)? is sufficient for o? when yu is known. It is easily seen directly from (17.85) 


that s? is not sufficient for o2 alone. 


17.39 ‘The principal results for sufficient statistics generalize to the k-parameter 
case inanatural way. ‘The condition (generalizing (17.83) ) for a distribution to possess 
a set of k jointly sufficient statistics for its k parameters becomes, under similar condi- 
tions of continuity and the existence of derivatives, 


j(x) = exp {= 4, (9) B;(x)+C(x)+D @)}, (17.86) 


a result due to Darmois (1935), Koopman (1936) and Pitman (1936). More general 
results of this kind are given by Barankin and Maitra (1963). The result of 17.35 on 
the unique MV properties of functions of a sufficient statistic finds its generalization in 
a theorem due to C. R. Rao (1947): for the simultaneous estimation of r(< k) functions 
t; of the k parameters 6,, the unbiassed functions of a minimal set of k sufficient 
Statistics, say t;, have the minimum attainable variances, and (if the range is independent 
of the 0,) the (not necessarily attainable) lower bounds to their variances are given by 


k Uk 


OT; OT; -1 
vari, = 24 ~ 00, 0, jl» 


where the information matrix 


1 ak 1 aL 
Gis { E (z wt =} (17.88) 


is to be inverted. (17.87), in fact, is a further generalization of (17.22) and (17.50), 
its simplest cases. Like (17.22), (17.87) takes account only of terms of order 1 /n in 
the variance. 


Ss eee (17.87) 


Sufficiency when the range depends on the parameter 

17.40 Now consider the situation when the range of the variable does depend 
on 6. We omit the trivial casem = 1. First, we take the case when only one terminal 
of the range, say the lower terminal, depends on 6. We then have the frequency func- 
tion f(« | 6), a(6) < x < b, where a() is a monotone function of 6, and @ is in some 
non-degenerate interval. 

Just as in Example 17.16, we have 


L(x | 9) = IL f(s | 0) u(y —a(0)) 


ESTIMATION 29 


where 

w2z)=1 £ z= >. 0, 

= 0 otherwise. 

It is at once obvious from the Likelihood Function that the smallest observation x, 
cannot be factored away from 6 in the u-function; thus, if there is a single sufficient 
statistic, it must be xq. But x, can only be sufficient if f(x; | 6) can be factored into 
a function of x; alone and a function of 6 alone, i.e. if 

fle | 8) = g(x)/W(0). (17.89) 


Then and only then is 


‘ 


_ U(%»—a(9)) 7 
L(x 0) = “a It es) 
of the form required at (17.68) for x4) to be sufficient. 

Evidently the same result will hold, with x, instead of x,), if the range of the variate 
is a < x < b(0), where J(6) is monotone in 9; if and only if (17.24) holds, x,,) is singly 
sufficient for 6. 

In these situations, Exercise 17.24 uses the result of 17.35 to establish an explicit 
form for the unique MV unbiassed estimator of any function 7(6). The reader 
should notice that there the problem is reparametrized so that the affected terminal 
is at 0 itself. 


17.41 If both terminals of the range depend on 6, we have a(#) < x < b(8) and 
nN) = 1 Oita ae) Xe 


We see by exactly the same argument that the extreme observations x, and x, are 
a pair of sufficient statistics for 0 if and only if (17.89) holds. We now consider whether 
there can be a single sufficient statistic in this case; if there is it must clearly be a function 
Of %,, ail a 
Essentially, we are asking whether we can find a single statistic which will tell us 

whether the product u(x,,—a(@)) u(b(0) —x,,) in L(x| 0) is equalto1 or 0. It is only 
equal to 1 if both 

Xa 2 a0), D0) > xq. (17.90) 
There are four possibilities: 
(1) a(@), 5(0) are both increasing functions of 6. (17.90) becomes 

a-* (x) > 0 > b-* (xq). 
(11) They are both decreasing functions of 0. (17.90) becomes 

b-* (Xq) 2 8 > a~*(X)). 
(iii) a(@) is increasing, (6) decreasing. (17.90) becomes 


a* (Hay) 2 0, b-*(%qy) > 8. (17.91) 
(tv) a(6) is decreasing, 5(@) increasing. (17.90) becomes 
a~* (%q) < 9, 5-* (Xm) < 9. (17.92) 


In cases (i) and (11), we need both x, and x,,), and no single sufficient statistic exists. 
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But (17.91) shows that in case (iii), u(t, 0) is equivalent to the product of the original 
u-functions, where 


#, = min {a—1(x,,), 0-1 (%,,)}; (17.93) 
so t, is singly sufficient for 6. Similarly, in case (iv), 
tz = max {a~*(Xq)), B-* (*q@)} (17.94) 


is singly sufficient since u(9—Z,) will do. We may summarize by saying that if the 
upper terminal is a monotone decreasing function of the lower terminal and (17.89) 
holds, there is a single sufficient statistic, given by ¢, or by ft, according as the lower 
terminal is an increasing or a decreasing function of 6. Exercise 20.13 gives the 
distribution of ¢,, from which that of ¢, is immediately obtainable. 

These results were originally due to Pitman (1936) and Davis (1951). The con- 


dition that 6 lies in a non-degenerate interval is important—cf. Example 17.23. 


Example 17.18 
For the rectangular distribution 
dF = dx/(20), -O<x< 6. 
we are in case (iv) of 17.41. The single sufficient statistic (17.92) is 
ty = MaX {—Xq, Xq} 
and since Bay Ss Minis 
this is the same as i, = toon Tx |, |x... |} 
which is intuitively acceptable. 


Example 17.19 
The distribution dF(x) = exp{—(x—«)} dx, « < x < o, is of the form (17.89), 


since it may be written 
f(x) = exp (—«)/exp(—a). 
Here the smallest observation, xq), is sufficient for the lower terminal «. 


Example 17,20 
The distribution 
dF (x) oc exp (—«x6) dx, 0 < 2£= i, 
evidently cannot be put in the form (17.89). Thus there is no single sufficient statistic 
for 0 when n > 2. 


Example 17.21 
In the two-parameter distribution 
dF (x) = dx/(B-a), a<x<f, 
it is clear that, given f, xq) is sufficient for « ; and given «, xm) is sufficient for f. Also 
x) and x») are a set of jointly sufficient statistics for « and 8. This is confirmed by 
observing that the joint distribution of xq) and xj) is, by (14.2) with r = 1, s = n, 
£(X ay» Xm) = 2(a—1) (Xn) — X ay)" ?/(B— &)", 

so that we may write 


L(x | a, B) = (B—a«)™ = g(x), Xm) R(x). 
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Example 17.22 3 


The rectangular distribution dF(x) = dx/0, k0O< x < (k+1)0; k > 0 comes 
under case (i) of 17.41. No single sufficient statistic exists, but (%q), %)) are a sufficient 


pair since (17.89) holds. 


Example 17.23 

The rectangular distribution dF = dx, 0 < x < 6+1; 6=0,1,2,..., in which 
6 is confined to integer values, does not satisfy the condition that the upper terminal 
be a monotone decreasing function of the lower. But, evidently, any single observa- 
tion, x;, in a sample of n is a single sufficient statistic for 6. In fact, [x,] estimates 0 
with zero variance. If the integer restriction on @ is removed, no single sufficient 
statistic exists, in accordance with 17.41. 


17.42 We have now concluded our discussion of the basic ideas of the theory of 
estimation. In later chapters (23-4) we shall be developing the theory of sufficient 
statistics further. Meanwhile, we continue our study of estimation from another point 
of view, by studying the properties of the estimators given by the Method of Maximum 
Likelihood. 


EXERCISES 
17.1 Show that in samples from 


aF. = gt test dest: pee 0520 Sieateio, 


1 
I (p) 6” 
the MVB estimator of 6 for fixed p is */p, with variance 0?/(np), while if 6 = 1 that of 


0 = ; : 0* log I'(p) 
ap 08 I'(p) is = = log x; with variance { ap? nN. 


17.2 A random variable x has f.f. 
f(x | 0) = Of, (x) +(1 —9) fo (x) 


where 0<0<1 and f,, fg are completely specified f.f.’s whose ranges of variation do not 
involve 6. Show that the MVB in estimating 0 from a sample of 1 observations is 


61—-O)f, (° file) fe , \- 
és {1 See ax} ; 


reducing to the binomial result (Example 17.9) when the ranges of f, and f, do not overlap. 
(Hill, 1963b) 


2 
) } for the binomial distribution in Example 17.11, 


1 oF L 
17.3. Evaluate J.. = E{ (7 rrr 


and use it to evaluate (17.45) exactly. Compare the result with the exact variance of t 
when 6 = i, as given in the Example. 


17.4 Writing (17.27) as 
Tig) T 
i pee 2 
and the characteristic function of t about its mean as ¢(z) (z = zu), show that for a MVB 
estimator 
ape) _7'@) 
Oz Jia 


(t—t) = 


wr +2t'(0)¢ «| 
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and that its cumulants are given by 


OT Oxy 
tran = 5596 |e = (A) 
Hence show that the covariance between ¢ and an unbiassed estimator of its rth cumulant 


is equal to its (r+1)th cumulant. 


Establish the inequality (17.50) for an estimated function of several parameters, and 
show that (A) holds in this case also when the bound is attained. 
(Bhattacharyya, 1946-7) 


17.5 Show that for the estimation of 6 in the logistic distribution 
fle) =e OE fe-—O}-4, 


the MVB is exactly 3/n, whereas the sample mean has exact variance 27/(3n) and the 
sample median asymptotic variance 4/n. 


17.6 Show that in estimating o in the distribution 


: 
aF = ogg (-75 dx, —-o <x < OW, 
: <7 1 
t, = (2) r (mn) /r pero} 
n $ 
and i = {3 Pa} cwr—a} Fr {ae-n} / Er (2") 


are both unbiassed. Show that (17.53) generally gives a greater bound than (17.24), 
which gives o?/(2n), but, by considering the case n = 2, that even this greater bound is 
not attained for small m by fy. 

(Chapman and Robbins, 1951) 


17.7 In estimating y? in 
aF ee x— yu)? dx, 


1 OL 1 @L 
ZL Di an L on > and hence, from 


(17.42), that 2—1/n is an unbiassed estimator of u* with minimum attainable variance. 


show that (x2—1/n—w?) is a linear function of — 


17.8 Show by direct consideration of the Likelihood Functions that, in the cases 
considered in Examples 17.8 and 17.9, sufficient statistics exist. 


17.9 For the three-parameter distribution 


a — p-—1 
dF (x) = Foe | Fi = po>0; a<x< oO, 


show from (17.86) that there are sufficient statistics for p and o individually when the 
other two parameters are known; and that there are sufficient statistics for p and o 
jointly if « is known; and that if o is known and p = 1, there is a sufficient statistic for «. 


17.10 Establish the bound (17.87) for the variance of one estimator in a set of 
estimators of several parametric functions. 


17.11 Show that if t, is the MV unbiassed estimator, and ¢, another unbiassed estim- 
ator, of 0, the covariance of t, and (f,—tf,) is zero. Hence show that we may regard the 
variation of (t,—@) as composed of two parts, one being the variation of (t;—@) and the 
other a component due to inefficiency of estimation. 


(cf. Fisher, 1925) 
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17.12 In the binomial distribution of Example 17.9, let a and b be integers or zero 
and define ta, = 67(1—6)°. Show that if a,b > 0 and‘a+b < n, tay has the unbiassed 
estimator 7 (n—r)”/n“@+, but that otherwise no unbiassed estimator of tp exists. 


17.13 For a sample of n observations from a distribution of form (17.89), with 
a< x < 6, show that the sufficient statistic Xm) 18 not an unbiassed estimator of 6. 
Using the method of 17.10, show that ¢’ = 2X(n) —Xm—1) is unbiassed to order n-1 and 


2 { h(@) ) 2 
has the same mean-square-error as xm) to order n—-2, namely FC - Show further 


that FE — 3 (Xn —Xn-3)) +X(n—2)- 
(cf. Robson and Whitlock, 1964) 


17.14 For a sample of 2 observations from a distribution with frequency function 
(17.86), and the range of the variates independent of the parameters, show that the statistics 


n 
t; = & B;(x), 2 = eee 
‘=1 


are a set of & jointly sufficient statistics for the & parameters 0,,...,0,, and that their 
joint frequency function is 


k 
g(t, to, . +S a |8) = exp {nD (@)} h(t, te, i le tx) exp 2 Aj (0) “| 
j= 
which is itself of the form (17.86). 


17.15 Use the result of Exercise 17.14 to derive the distribution of % in Example 17.6. 


(x —)? is the multiple of the sample variance with minimum 


17.16 Show that 
n+1 


mean-square-error in estimating the variance of a normal population. 


17.17 Use the method of 17.10 to correct the bias of the sample variance in estimating 
the population variance. (Cf. the result of Example 17.3.) 
(Quenouille, 1956) 


17.18 Show that if the method of 17.10 is used to correct for bias, the variance of 


t, is asymptotically the same as that of t, to order 1/n. 
(Quenouille, 1956) 


17.19 Use the method of 17.10 to correct the bias in using the square of the sample 
mean to estimate the square of the population mean. 


17.20 For a (positive definite) matrix of variances and covariances, the product of 
any diagonal element with the corresponding diagonal element of the reciprocal matrix 
cannot be less than unity. Hence show that if, in (17.87), we have r = k and t% = 0; 
(all z), the resulting bound for an estimator of 6; is not less than the bound given by (17.24). 
Give a reason for this result. 


(C. R. Rao, 1952) 


17.21 The MVB (17.22) holds good for a distribution f(x|6) whose range (a, b) 
depends on 0, provided that (17.18) remains true. Show that this is so if 


f(a|9) = f(6|6) = 0, 


[eel 2 pare a 
00 ae 00 — ; 


(17.19) also remains true and we may write the MVB in the form (17.23). 


and that if in addition 
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17.22 Apply the result of Exercise 17.22 to show that the MVB holds for the estima- 
tion of @ in 


dF (x) = m5 


and is equal to (p—2)/n, but is not attainable since there is no single sufficient statistic 
for 6. 


(x—0)?-lexp {—(x— 9) } dx, Gna et Os Soe 


17.23 x is a random variable in the range (a, b) (which may depend on @) whose 
distribution is f(x | 0). If E{t(x, 0)} = t(@) and f and @f/dx vanish at a and at b, show that 


> [RN ACH} 


(Cf. Exercise 17.21 to establish (17.22).) 
(B. R. Rao (1958b). The analogue of (17.45) also holds, t” being 


replaced by E iz) and =: by : in Jrp—see Sankaran (1964).) 
17.24 For a distribution of form 
F(x) = a(x)/h(0), I< x< 8, 
show that if a function t(x) of a single observation is an unbiassed estimator of 7(@), then 
—t(x) = {t(x)h’ (x) +7’ (x) h(x) }/g(*). 
Hence show, using 17.35, that the unique MV unbiassed estimator of t(#) in samples 


of size n 1s 
= (X(4)) h(x) 


— P(%qy) = T(Xq)) — 


n g(x)’ 
and similarly if a< x < 90 that it is 
Tv’ (X(n)) h(xn)) 
Xm) = T(xXqy) + —— 
D(Xn)) (x(n)) 5 ee 


Illustrate on 
fig = 1 oes 8, 
and on 
f(x) = exp {-(x—0)}, 0< x< 0. 
(Tate, 1959) 


P) 


17.25 Ifthe pair of statistics (t,, t2) is jointly sufficient for two parameters (8,, 9,),and t, 
is sufficient for 0, when 9, is known, show that the conditional distribution of t,, given ¢,, 
is independent of 0,. As an illustration, consider k independent binomial distributions 
with sample sizes m(i = 1,2,...,) and parameters 6; connected by the relation 


0; 
log 's = “a+ B xi, 


d 


and show that if y; is the number of ‘ successes” in the ith sample, the conditional 
distribution of di xi yi, given Xi, is independent of «. 
. . (D. R. Cox, 1958a) 


17.26 If the zero frequency of a Poisson distribution cannot be observed, it is called 
a truncated Poisson distribution. Show that from a single observation x (x = 1, 2,...), 
on a truncated Poisson e~°6"/x!, the only unbiassed estimator of 1—e—® takes the values 
0 when x is odd, 2 when x is even. 


17.27 As in Example 17.13, show that the efficiency of the estimate of o based on 
the mean difference discussed in 10.14 is 0-978 in normal samples. 


CHAPTER. 18 
ESTIMATION : MAXIMUM LIKELIHOOD 


18.1 We have already (8.6-10) encountered the Maximum Likelihood (abbrevi- 
ated ML) principle in its general form. In this chapter we shall be concerned with 
its application to the problems of estimation, and its properties when used as a method 
of estimation. We shall confine our discussion for the most part to the case of samples 
of n independent observations from the same distribution. The joint probability of 
the observations, regarded as a function of a single unknown parameter 0, is called 
the Likelihood Function (abbreviated LF) of the sample, and is written 


L (x6) = f(%1| 0) f(%2/ 6). - - f(%n| 8), (18.1) 
where we write f(x|0) indifferently for a univariate or multivariate, continuous or 
discrete distribution. 

The ML principle, whose extensive use in statistical theory dates from the work 
of Fisher (1921a), directs us to take as our estimator of @ that value (say, 6) within the 
admissible range of 6 which makes the LF as large as possible. ‘That is, we choose 6 
so that for any admissible value 0 

L(x|6) > L(x|6). (18.2) 
We assume that 9 may take any real value in an interval (which may be infinite in either 
or both directions). 


18.2 The determination of the form of the ML estimator becomes relatively 
simple in one general situation. Ifthe LF isa twice-differentiable function of 6 through- 
out its range, stationary values of the LF within the admissible range of 6 will, if 
they exist, be given by roots of 


Pie = el = 6. (18.3) 


A sufficient (though not a necessary) condition that any of these stationary values 
(say, 6) be a local maximum is that 


L" (x|6) < 0. (18.4) 
If we find all the local maxima of the LF in this way (and, if there are more than 


one, choose the largest of them) we shall have found the solution(s) of (18.2), provided 
that there is no terminal maximum of the LF at the extreme permissible values of 0. 


18.3 In practice, it is often simpler to work with the logarithm of the LF than 
with the function itself. Under the conditions of the last section, they will have 
maxima together, since 

0 
— loge L = L’ 
=p 108 D'/L 
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and L > 0. We therefore seek solutions of 

(log L)’ = 0 (18.5) 
. for which 

(log L)’”’ < 0, (18.6) 
if these are simpler to solve than (18.3) and (18.4). (18.5) is often called the likeli- 
hood equation. 


Maximum Likelihood and sufficiency 
18.4 Ifa single sufficient statistic exists for 0, we see at once that the ML estimator 
of 0 must be a function of it. For sufficiency of ¢ for 6 implies the factorization of 
the LF (17.84). That is, 
L(x|0) = g(t] 9)h(x), (18.7) 
the second factor on the right of (18.7) being independent of 6. Thus choice of 4 
to maximize L(«|0) is equivalent to choosing 6 to maximize g(t|6), and hence 6 will 
be a function of ¢ alone. 


18.5 If a MVB estimator ¢ exists for 7(6), and the likelihood equation (18.5) has 
a solution 6, then ¢ = 7() and the solution 6 is unique, occurring at a maximum of 
the LF. For we have seen (17.33) that, when there is a single sufficient statistic, the 
LF is of the form in which MVB estimation of some function of 6 is possible. Thus, 
as at (17.27), the LF is of the form 


(log L)’ = A(6){t—7(6) }, (18.8) 
so that the solutions of (18.5) are of form 
t = x(6). (18.9) 
Differentiating (18.8) again, we have 
(log L)” = A’(6){t—7(6) }— A(6)7’ (9). (18.10) 


But since, from (17.29), 
t' (0)/A (6) = vart, 
the last term in (18.10) may be written 
—A(6)t' (6) = — {A (6) }vart. (18.11) 
Moreover, at 6 the first term on the right of (18.10) is zero in virtue of (18.9). Hence 
(18.10) becomes, on using (18.11), 


(log L)'5 = — {A(6) }vart < 0. (18.12) 


By (18.12), every solution of (18.5) is a maximum of the LF. But under regularity 
conditions there must be a minimum between successive maxima. Since there is no 
minimum, it follows that there cannot be more than one maximum. This is other- 
wise obvious from the uniqueness of the MVB estimator t. 

(18.9) shows that where a MVB (unbiassed) estimator exists, it is given by the ML 
method. 

18.6 ‘Ihe uniqueness of the ML estimator where a single sufficient statistic exists 
extends to the case where the range of f(«|6) depends upon 6, but the argument is 
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somewhat different in this case. We have seen (17.40-1) that a single sufficient statistic 
can only exist if 

f (| 0) = g(x)/h(0). (18.13) 
The LF is thus also of form 


L(w|6) = I g(x)/{h(0) } (18.14) 
and (18.14) is as large as possible if h(@) is as small as possible. Now from (18.13) 
1 = | f(x|9)dv = | g(x) dx/h(O), 
where integration is over the whole range of x. Hence 


h(6) = | g(x) de. (18.15) 


From (18.15) it follows that to make /(6) as small as possible, we must choose 6 so that 
the value of the integral on the right (one or both of whose limits of integration depend 
on 0) is minimized. | 

Now a single sufficient statistic for 0 exists (17. 40-1) only if one terminal of the 
range is independent of 0 or if the upper terminal is a monotone decreasing function 
of the lower terminal. In either of these situations, the value of (18.15) is a monotone 
function of the range of integration on the right-hand side, reaching a unique terminal 
minimum when that range is as small as is possible, consistent with the observations. 
The ML estimator 6 obtained by minimizing this range is thus unique, and the LF 
(18.14) has a terminal maximum at L (x |6). : 

The results of this and the previous section were originally obtained by Huzurbazar 
(1948), who used a different method in the “ regular” case of 18.5. 


18.7 Thus we have seen that where a single sufficient statistic ¢ exists for a para- 
meter 0, the ML estimator 6 of 6 is a function of t alone. Further, 6 is unique, the 
LF having a single maximum in this case. ‘The maximum is a stationary value (under 
regularity conditions) or a terminal maximum according to whether the range is inde- 
pendent of, or dependent upon, 8. 


18.8 It follows from our results that all the optimum properties of single sufficient 
statistics are conferred upon ML estimators which are one-to-one functions of them. 
For example, we need only obtain the solution of the likelihood equation, and find the 
function of it which is unbiassed for the parameter. It then follows from the results 
of 17.35 that this will be the unique MV estimator of the parameter, attaining the 
MVB (17.22) if this is possible. 

The sufficient statistics derived in Examples 17.8, 17.9, 17.10, 17.16, 17.18 and 
17.19 are all easily obtained by the ML method. 
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Example 18.1 


To estimate @ in 
dF (x) = dx/6, O<x< 8, 


we see at once from the LF in Example 17.16 that 6 = x,,), the sufficient statistic, the 
LF having a sharp (non-differentiable) maximum there. 

Obviously, 6 is not an unbiassed estimator of 6. A modified unbiassed estimator is 
easily seen to be 


f= (n+1) Xn) /N. 


Example 18.2 
To estimate the mean @ of a normal distribution with known variance. We have 
seen (Example 17.6) that 


(log L)’ = (#-6). 


We obtain the ML estimator by equating this to zero, and find 
: G = &. 
In this case, 9 is unbiassed for 6. 


The general case 

18.9 If no single sufficient statistic for 0 exists, the LF no longer necessarily has 
a unique maximum value (cf. Exercises 18.17, 18.33), and we choose the ML estimator 
to satisfy (18.2). We now have to consider the properties of the estimators obtained 
by this method. We shall see that, under very broad conditions, the ML estimator 
is consistent; and that under regularity conditions, the most important of which is that 
the range of f(x | 0) does not depend on 6, the ML estimator is asymptotically normally 
distributed and is an efficient estimator. ‘These, however, are large-sample properties 
and, important as they are, it should be borne in mind that they are not such powerful 
recommendations of the ML method as the properties, inherited from sufficient statistics, 
which we have discussed in sections 18.4 onwards. Perhaps it would be unreasonable 
to expect any method of estimation to produce “ best ”’ results under all circumstances 
and for all sample sizes. However that may be, the fact remains that, outside the field 
of sufficient statistics, the optimum properties of ML estimators are asymptotic ones. 


Example 18.3 
As an example of the general situation, consider the estimation of the correlation 
parameter p in samples of m from the standardized bivariate normal distribution 


a I =e I 2 = 2 
dP = 5 ay? | 2)“ 2oay+yt) b dvd, 


—o< xy <0; |p| <1. 
We find 


log L = —nlog (22) — $nlog (1 — p?)— Tiere ees —2pUxy+Xy?), 


ie 
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whence, for = 0 we have 


dlog L 
Op 


nN 


reducing to the cubic equation 
1 1 1 
tg et sh St Sy?) = 0. 
p(1—p*) + (1 +p?) aay (| or y*) 0 


This has three roots, two of which may be complex. If all three are real, and yield 
values of p in the admissible range, then in accordance with (18.2) we choose as the 
ML estimator that which corresponds to the largest value of the LF. 

If we express the cubic equation in the form 


3+ pi+q = 0 


axy = 0, 


1—p? 


with 
1 


the condition that there shall be only one real root is that 
4p? +279" > 0 


and is certainly fulfilled when p > 0, where 
| ee ee 
a gt Oe Bes ak 
p Bue so y 3(527) 1 (18.16) 


Since, by the results of 10.3 and 10.9, the sample moments in (18.16) are consistent 
estimators of the corresponding population moments, we see from (18.16) that p con- 
verges in probability to : 
(1+1—4p?—1) = 1—4p? > 0. 

Thus, in large samples, there will tend to be only one real root of the likelihood equa- 
tion, and it is this root which will be the ML estimator, the complex roots being inad- 
missible values. 


The consistency of Maximum Likelihood estimators 

18.10 We now show that, under very general conditions, ML estimators are 
consistent. 

As at (18.2), we consider the case of m independent observations from a distribu- 
tion f(x|6), and for each m we choose the ML estimator 6 so that, if @ is any admissible 
value of the parameter, we have) 


log L(x|0) > log L («| 6). (18.17) 
We denote the true value of 0 by 05, and let E, represent the operation of taking expecta- 
tions when the true value 0, holds. Consider the random variable L(x|0)/L(x| 0). 


(*) Because of the equality sign in (18.17), the sequence of values of 6 may be determinable in 
more than one way. See 18.11 and 18.13 below. 
D 
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In virtue of the fact that the geometric mean of a non-degenerate distribution cannot 
exceed its arithmetic mean, we have, for all 0* 4 6), 


E, {log ae ct < log BF; iT rate (18.18) 


Now the expectation on the right-hand side of (18.18) is 


ti | era TE (ssl OMe a Be =o. 
Thus (18.18) becomes 
L (x | 0*) 
Es{los Ter} <° 
or, inserting a factor 1/n, 
E, {tog (x|0*) | < Ey {Hoel (xl 00) } (18.19) 


provided that the expectation on the right exists, as it does very generally. 
Now for any value of 0 


oe E0610) = : x log f(x; | 9) 
n nN j=1 


is the mean of a set of m independent identical random variables with expectation 
| 1 
B, {log f(x| 9) } = Ey {log L(x| 0 t 


By the Strong Law of Large Numbers (7.25), therefore, “log [(x|6) converges with 


probability unity to its expectation, as m increases. Thus as n—> 00 we have, from 


(18.19), with probability unity 
“log L (| 6*) < “log L (+|) 


or 


lim prob {log L(x | 0*) < log L(w|0)} = 1, 6* # 4. (18.20) 


On the other hand, (18.17) with 0 = 0, gives 
log L(x|6) > log L («| 0,). (18.21) 
(18.20) and (18.21) imply that, as n—> o, L(x|6) cannot take any other value than 
L(x|6o),. If L(x|6) is a one-to-one function of 6, this implies that 
prob {lim 6 = 0)} = 1. (18.22) 
n—> 00 
This is a heuristic form of Wald’s (1949) rigorous proof of the consistency of ML 
estimators, which requires further conditions. 


18.11 We have shown that any sequence of estimators 6 obtained by use of (18.2) 
is consistent. This result is strengthened by the fact that Huzurbazar (1948) has shown 
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under regularity conditions that ultimately, as m increases, there is a unique consistent 
ML estimator. 

Suppose that the LF possesses two derivatives. It follows from the convergence 
in probability of 6 to 0, that 


1 | apiloeZ (ol) | = >: | spaloeL (el) | 2 (18.23) 


Now by the Strong Law of Large Numbers, once more, 
1 3? coke 


is the mean of m independent identical variates and converges with probability unity 
to its mean value. Thus we may write (18.23) as 


Jim_ prob { [pile L (xl) | ake | agile (10) | eA =1. (18,24) 


But we have seen at (17.19) that under regularity conditions 


dlog L(x | 6)\? 
B| FalogL(x|0)| = -E}( = } < 0. (18.25) 
Thus (18.24) becomes 
lim prob { | spalogL(x|6)| < 0| = ae © (18.26) 
n—> co =6 


18.12 Now suppose that the conditions of 18.2 hold, and that two local maxima 
of the LF, at 9, and 8, are roots of (18.5) satisfying (18.6). If log L(x|0) has a second 
derivative everywhere, as we have assumed in the last section, there must be a mini- 
mum between the maxima at 6, and 6,. If this is at 63, we must have 


el. > 0. (18.27) 


But since 6, and 6, are consistent estimators, §,, which lies between them in value, 
must also be consistent and must satisfy (18.26). Since (18.26) and (18.27) directly 
contradict each other, it follows that we can only have one consistent estimator 6 
obtained as a root of the likelihood equation (18.5). 


16.13 A point which should be discussed in connexion with the consistency of 
ML estimators is that, for particular samples, there is the possibility that the LF has 
two (or more) equal suprema, i.e. that the equality sign holds in (18.2). How can we 
choose between the values 6,, 65, etc., at which they occur? ‘There seems to be an 
essential indeterminacy here. Fortunately, however, it is not an important one, since 
the difficulty in general only arises when particular configurations of sample values 
are realized which have small probability of occurrence. However, if the parameter 
itself is essentially indeterminable, the difficulty can arise in all samples, as the following 
example makes clear. 
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Example 18.4 
In Example 18.3 put 
cos 6 = p. 
To each real solution of the cubic likelihood equation, say p, there will now correspond 
an infinity of estimators of 0, of form 
6, = arccosp+2rx 
where r is any integer. ‘The parameter 0 is essentially incapable of estimation. Con- 
sidered as a function of 0, the LF is periodic, with an infinite number of equal maxima 
at §,, and the 6, differ by multiples of 2x. There can be only one consistent estimator 
of 0, the true value of 0, but we have no means of deciding which 6, is consistent. 
In such a case, we must recognize that only cos @ is directly estimable. We say that 6 
is unidentifiable. 


Consistency and bias of ML estimators 

18.14 Although, under the conditions of 18.10, the ML estimator is consistent, 
it is not unbiassed generally. We have already seen in Example 18.1 that there may 
be bias even when the ML estimator is a function of a single sufficient statistic. In 
general, we must expect bias, for if the ML estimator is 6 and we seek to estimate a 
function 7(6), we have seen in 8.9 that the ML estimator of 1(@) is 7(6). But in general 


E{c()} #r EO}, (18.28) 
so that if 6 is unbiassed for 0, (9) cannot be unbiassed for 1(6). If the ML estimator 
is consistent, the paragraph below Example 17.3 may apply. 

Brillinger (1964) shows that if ML bias is removed by the method of 17.10, ¢,’ at 


(17.10) is asymptotically normal under regularity conditions, and he gives expansions 
for its bias and mean-square-error. 


The efficiency and asymptotic normality of ML estimators 

18.15 When we turn to the discussion of the efficiency of ML estimators, we can- 
not obtain a result as clear-cut as that of 18.10. The following example is enough to 
show that we must make restrictions before we can obtain optimum results on efficiency. 


Example 18.5 
We saw in Example 17.22 that in the distribution 
dF (x) = dx/8, k0 <x <(k+1)0; k>9O, 

there is no single sufficient statistic for 0, but that the extreme observations xq) and %n) 
are a pair of jointly sufficient statistics for 6. Let us now find the ML estimator of 6. 
We maximize the LF as in Example 18.1. Since 

L(x | 0) = 0- u(x q, — RO) u((R + 1)0 — Xn) 
we have £ 

O = Xm/(k+1), 

which is accordingly the ML estimator. We see at once that 6 is a function of x() 
only, although xq) and x) are both required for sufficiency. 
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Now by symmetry, xq) and x;,) have the same variance, say V. ‘The ML estimator 
has variance : 
var@ = V/(k+1)?, 
and the estimator 
has variance 
Varo" = Fk. 
Since xq) and x,) are asymptotically independently distributed (14.23), the function 
6 = a+(1—a)6* 


will, like 6 and 6*, be a consistent estimator of 0, and its variance is 


var§ = Veet ee} 


Antiy 
+ (REDE 
var0 = V/{k?+(k+1)? }. 


Then 


which is minimized (cf. Exercise 17.21) when a = 


Thus, for all k > 0, 

var 0 (k+1)? a4 

var) k?+(k+1)? 
and the ML estimator has larger variance. If k is large, the variance of 9 is nearly 
twice that of the other estimator. 


18.16 We now show, following Cramér (1946), that if the first two derivatives of 
the LF with respect to @ exist in an interval of 0 including the true value 04, if 


g(a) 5, (18.29) 
and 
R2(0) = —2 (eI) = BY (38) } (18.30) 


exists and is non-zero for all 6 in the interval, the ML estimator 6 is asymptotically 
normally distributed with mean 6, and variance equal to 1/R?(6). 
Using Taylor’s theorem, we have 


dlogL\ _/dlogL A O log L 
( _ ),=( . ), + o)( — J. (18.31) 


where 6* is some value between 6 and 6,._ Under our regularity conditions, 6 is a root 
of (18.5), so the left-hand side of (18.31) is zero and we may rewrite it as 


dlog L R(6.) 
(6-65) R(9o) = 2 L) ie aie (18.32) 
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In the denominator on the right of (18.32), we have, since 0 is consistent for 6) and 6* 
lies between them, from (18.24) and (18.30), 


log L 
lim prob { | 762 [7 - R(6,) | =f, (18.33) 


n—> oo 
so that the denominator converges to unity. The numerator on the right of (18.32) 
is the ratio to R(,) of the sum of the m independent identical variates 0 log f(x; | )/08. 
This sum has zero mean by (18.29) and variance defined at (18.30) to be R?(@,). The 
Central Limit Theorem (7.26) therefore applies, and the numerator is asymptotically 
a standardized normal variate; the same is therefore true of the right-hand side as 
a whole. ‘Thus the left-hand side of (18.32) is asymptotically standard normal or, in 
other words, the ML estimator 6 is asymptotically normally distributed with mean 6, 
and variance 1/R?(6,). 
Daniels (1961) relaxes the above condition for the asymptotic normality and efficiency 
of ML estimators. 


18.17 This result, which gives the ML estimator an asymptotic variance equal to 
the MVB (17.24), implies that under these regularity conditions the ML estimator is 
efficient. Since the MVB can only be attained in the presence of a sufficient statistic 
(cf. 17.33) we are also justified in saying that the ML estimator is ‘ asymptotically 
sufficient.” 


Lecam (1953) has objected to the use of the term “ efficient’ because it implies 
absolute minimization of variance in large samples, and in the strict sense this is not 
achieved by the ML (or any other) estimator. For example, consider a consistent 
estimator t of 0, asymptotically normally distributed with variance of order n~’. Define 
a new statistic 

a eee ee, 
v= 18.34 
is wos | ( 
We have 
oo oe me 8 
eo 0 =o, 


and k may be taken very small, so that at one point ¢’ is more efficient than t, and nowhere 
is it worse. Lecam has shown (cf. also Bahadur (1964)) that such “ superefficiency ” 
can arise only for a set of 0-values of measure zero. In view of this, we shall retain the 
term “ efficiency ” in its ordinary use. However, C. R. Rao (1962b) shows that even this 
limited paradox can be avoided by redefining the efficiency of an estimator in terms 
log L ' 

= —cf. 17.15-17 and (17.61). Walker (1963) gives sufficient 
regularity conditions for the asymptotic variances of all asymptotically normal estimators 
to be bounded by the MVB. 


lim var t’/vart = 
n—> © 


: : a 
of its correlation with 


Example 18.6 
In Example 18.3 we found that the ML estimator 6 of the correlation parameter 
in a standardized bivariate normal distribution is a root of the cubic equation 


dlog L nN 
> eee = er he 2puxyt+ dy® ee Fea 
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If we differentiate again, we have | 
log L  n(1+p?) (1+3p?) 
Op? ee ar 
so that, since E(x?) = E(y?) = 1 and E(xy) = p, 
E(* log =’ _ m(L+p*)_ 2n(1+3p*) | 4p? 
Op? petites Gere)! « (iF 
n(1+p*) 
(=p 
Hence, from 18.16, we have asymptotically 


(21x? — 2p 2 xy + Ly*) + ———, axy 


er: 


Example 18.7 
The distribution 
dF (x) = sexp {—|x—-0| } dx, —-o<xK< Ow, 
yields the log likelihood 
log L(x|6) = —nlog2— X |x,;-6]. 
i=1 
This is maximized when &|x;—6| is minimized, and by the result of Exercise 2.1 this 


occurs when 0 is the median of the values of x. (If ” is odd, the value of the middle 
observation is the median; if m is even, any value in the interval including the two 
middle observations is a median.) Thus the ML estimator is 6 = #, the sample 
median. It is easily seen from (14.20) that its large-sample variance in this case 1s 
var6 = 1/n. 
We cannot use the result of 18.16 to check the efficiency of 6, since the differenti- 
ability conditions there imposed do not hold for this distribution. But since 
dlogf({x|0) {+1 of «x > 0, 
00 = be ie 8, 
only fails to exist at x = 6, we have 


(CD) ere 


so that if we interpret E (ee («| Oa 
Sie 
BE} (==)'} = ng { (28F=19)\") - 


so that the MVB for an estimator of 6 is 
vart > 1/n, 
which is attained asymptotically by 6. 


we have 
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18.18 ‘The result of 18.16 simplifies for a distribution admitting a single sufficient 
statistic for the parameter. For in that case, from (18.10), (18.11) and (18.12), 


B(“S) = _ A(6)z' (6) = (=), (18.35) 


00? 00? 
so that there is no need to evaluate the expectation in this case: the MVB becomes 
2 a ‘ 
simply —1 7 (° = =) , is attained exactly when @ is unbiassed for 0, and asymp- 
6=6 

totically in any case under the conditions of 18.16. : 

If there is no single sufficient statistic, the asymptotic variance of 6 may be esti- 
mated in the usual way from the sample, an unbiassed estimator commonly being 
sought. 


Example 18.8 
To estimate the standard deviation o of a normal distribution 
dF (x) = sami? ( Fea) Se Se 
We have 
2 x 
202’ 


(log L)’ =—n/o  +2x?/o3, 
so that the sufficient ML estimator is 6 = Ri (52+) and 


log L(x|0) = —nlogo— 


" n 36° 
(log L)”:- = n/o?—3 Xx*/o* = (1-5) 
Thus, using (18.35), we have as nm increases 


var 6 —> —1/(log L)3_, = o?/(2n). 


The cumulants of a ML estimator 

18.19 Haldane and Smith (1956) have carried out an investigation in which, under 
regularity conditions, they obtain expressions for the first four cumulants of a ML 
estimator. Suppose that the distribution sampled is divided into a denumerable set 
of classes, and that the probability of an observation falling into the rth class is 
z,(r = 1,2,...). We thus reduce any distribution to a multinomial distribution 
(5.30), and if the range of the original distribution is independent of the unknown 
parameter 0, we seek solutions of the likelihood equation (18.5). Since the prob- 
abilities z, are functions of 6, we write, from (5.78), 


L(x|0) oc Int, (18.36) 
where n, is the number of observations in the 7th class and Xn, = n, the sample size. 
(18.36) gives the likelihood equation as 


dlogL _ Ty 
es Mr 


= 0, (18.37) 
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where a prime denotes differentiation with respect to 0. Now, using Taylor’s theorem, 
we expand z, and a’, about the true value 65, and obtain 


rt, (6) = 20, (80) + (9-80) 707 (Oo) +3 (O— 90)?’ (Do) + --- } 
wt, (8) = 20 (00) + (8-00) tr’ (80) +4 (6-80)? (Og) + «+ 


If we insert (18.38) into (18.37), expand binomially, and sum the series, we have, 
writing 


(18.38) 


1 = {exh (00) }#*/ fr (00) 36 

B, = © {0s (00) }°27 (Oe)/ rr a) 

Cy = & fers Ba) }° fe! Oo) 3°/ fre (Oo) ¥ 
Dy = © {o1-(B0) }$20." (Bo)/ {ore (Bo) 3 


as = E (05 (0)}* 4m (00) | / be 00) 3h 
Ba = E {re (60) 424" (0.){™—4(65) | / fre (Ou) 
8, = E {e (04) }¥a2" (65) 4 St — ar) + / fn (00) 


the expansion 
a — (Ay +09 f3) (0-64) +4 (24q—3B, + 201g— 382+ 54) (0-9,)? 
—4(64,—12B,4+3C,+4D,)(0—0,)?+ ... = 0. (18.39) 
For large n, (18.39) may be inverted by Lagrange’s theorem to give 
(6-69) = Ayta,+A;*oy[(A2—3B1) %1— A; (%2—fi)] 
+ Ay®a,[ {2(A,—2B,)?—A,(A3—2B,+3C,4+2D,)) } 0? 
—3A,(A,— 3B) & (%2— By) + ZA & (203-382 +4)) 
+ Aj (%2— Bi)? ]+O(n-*). (18.40) 


(18.40) enables us to obtain the moments of 6 as series in powers of n-1. 


18.20 Consider the sampling distribution of the sum 


ius Shy 1, (65) 


where the h, are any constant weights. From the moments of the multinomial distri- 
bution (cf. (5.80)), we obtain for the moments of W, writing S; = Xhiz,(6,), 


i (W) = 0, 

H2(W) = n-(S2— Si), 

ft3(W) = n-*(S3—S,S,+258}), 

Hg(W) = 3n-2(S,—S?)?+n-9(S,—4S, S,—3S3+ 12S? S,—6S%), (18.41) 
us(W) = 10n-8(S,—S2)(S3—3S,S,+2S%) + O(n-4), 

Le(W) = 15n-*(S,— S%)3 + O(n-*). 
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From (18.41) we can derive the moments and product-moments of the random vari- 
ables «,, 8; and 6; appearing in (18.40), for all of these are functions of form W. Finally, 
we substitute these moments into the powers of (18.40) to obtain the moments of 6. 
Expressed as cumulants, these are 

Kk, = 0,—4n-1 A? B,+O(n-), 

kg = n-1Alv1+n? A; *[—A2+4B2+ A, (A,;—B,—D,)—A?]+O0(n-%), 


kg = n-*? Ay? (A,—3B,)+O(n-4), = 
Kk, = n- Ar>[—12B,(A,—2B,)+A,(A3;—4D,)—3A3]+O0(n-4), 
whence 
V1 = K3/Ky = n-* AT? (A,—3B,)+0(n-?), \ (18.43) 
Yo = K4/Ky = n AP? [—12B,(A,—2B,)+ A, (A3—4D,)—34;]+0(n-). 


The first cumulant in (18.42) shows that the bias in 6 is of the order of magnitude n— 
unless B, = 0, when it is of order n-*, as may be confirmed by calculating a further 
term in the first cumulant. ‘The leading term in the second cumulant is simply the 
asymptotic variance previously established in 18.16. (18.43) illustrates the rapidity 
of the tendency to normality, established in 18.16. 

If the terms in (18.42) were all evaluated, and unbiassed estimates made of each of 
the first four moments of 6, a Pearson distribution (cf. 6.2-12) could be fitted and an 
estimate of the small-sample distribution of 6 obtained which would provide a better 
approximation than the ultimate normal approximation of 18.16. 


The next higher order terms in (18.42-3) are derived by Shenton and Bowman (1963). 


Successive approximation to ML estimators 

18.21 In most of the examples we have considered, the ML estimator has been 
obtained in explicit form. ‘The exception was in Example 18.3, where we were left 
with a cubic equation to solve for the estimator, and even this can be done without 
much trouble when the values of x are given. Sometimes, however, the likelihood 
equation is so complicated that iterative methods must be used to find a root, starting 
from some trial value tf. 

As at (18.31), we expand @ log L/00 in a Taylor series, but this time about its value 


at ¢t, obtaining 
_ falogL\ — (dlogL , (0? log L 
as ( a0 ), . ( a0 )+¢ a( 262 ‘- 


where 6* lies between 6 and t. Thus 


qu gL 0? log L 
p=7 ( ap ) / (FB), (18.44) 


If we can choose ft so that it is likely to be in the neighbourhood of 6, we can replace 
6* in (18.44) by ¢ and obtain 


go Pe ge 0? log L 
(ae), = 


which will give a closer approximation to 6. The process can be repeated until no 
further correction is achieved to the desired degree of accuracy. 
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The most common method for choice of ¢ is to take it as the value of some (pre- 
ferably simply-calculated) consistent estimator of 0. ‘Then, as 2 —> ©, we shall have 
the two consistent estimators ¢ and 6 converging to 6), and 6* consequently also doing so. 


2 
The three random variables (° - =) (“Se =), and Ee a =) I, will all 
0? log L 


592 )] . Use of the second of these variables, instead of the 
60 


first, in (18.44) gives (18.45) above: use of the third instead of the first gives the alter- 
native iterative procedure 


, , (AlogL log L\] _ d log L , 
= (“- ) Fs Le )], = 0+ (FGF) (oa Oe (18.46) 


using 18.16. (18.45) is the Newton—Raphson iterative process: (18.46) is known 
as ‘‘the method of scoring for parameters, ” and is due to Fisher (1925). Kale (1961) 
shows that (18.46) will usually be the quicker process for large m unless extremely high 
accuracy is ultimately required. It is usually less laborious. 

Both (18.45) and (18.46) may fail to converge in particular cases. Even when they 
do converge, if the likelihood equation has multiple roots there is no guarantee that 
they will converge to the root corresponding to the absolute maximum of the LF; 
this should be verified by examining the changes in sign of @ log L/00 from positive to 
negative and searching the intervals in which these changes occur to locate, evaluate 
and compare the maxima. V. D. Barnett (1966a) discusses a systematic method of 
doing this, using the ‘‘ method of false positions.” 


converge to E ( 


Example 18.9 


To estimate the parameter 9 in the Cauchy distribution 
payee 


m {1+(x—6)? } 
The likelihood equation is 
dlogL _ 2 : At 2 = 0 
00 = a | {1+(x«;,—6)? } a 


an equation of degree (2n—1) in 6. From 18.16 the asymptotic variance of 6 is 
given by 
1 Od log L 0*log f 
Soe wae a eters 
var 0 E( ap : ( a0 ) 

is (i Bi ae Gas? 
<3 aaa re 
i | re ay 

* +, {1427 
= —n/2. 


—-o <cxg O. 


Hence 
var = 2/n. 
The equation has multiple. roots in general, and for small n, (18.45) or (18.46) 
may not converge—cf. V. D. Barnett (1966a). Only negligibly, however, 1s 6 not the 
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nearest maximum to the sample median, ¢, which has large-sample variance (Example 
17.5) 
vart = ?/(4n) 

and thus has efficiency 8/z? = 0-8 approximately. In large samples, we may therefore 
use the median as our starting-point in seeking the value of 6, and solve (18.46), which 
here becomes 

f 4 (x,;,—f) 

C= aE +(e} 
This is our first approximation to 6, which we may improve by further iterations of 
the process. 

Bloch (1966) gives an estimator with efficiency >0-95, namely the linear combina- 
tion of 5 order-statistics x;,) (r = 0-13n, 0-4n, 0-5n, 0-6n, 0-87n) with weights —0-052, 
0-3485, 0-407, 0-3485, —0-052. This would therefore be an excellent starting-point 
for ML iteration. Rothenberg et al. (1964) show that the mean of the central 24 per 
cent of a Cauchy sample has asymptotic variance 2:28/n, its efficiency therefore being 
0-88. See also V. D. Barnett (1966b). 


Example 18.10 

We now examine the iterative method of solution in more detail, and for this pur- 
pose we use some data due to Fisher (1925-, Chapter 9). 

Consider a multinomial distribution (cf. 5.30) with four classes, their probabilities 
being 


Pi = (2+0)/4, 
p2 = p3 = (1-4) /4, 
Pi 0/4. 


The parameter 0, which lies in the range (0, 1), is to be estimated from the observed 
frequencies (a, b, c, d) falling into the classes, the sample size m being equal to 
a+b+c+d. We have 

L (a, b, c, d| 6) oc (2+6)*(1 —6)?*° 64, 
so that 
dlogL a _ (b+c),d 
6600©«24+0 «1-0 @ 
and if this is equated to zero, we obtain the quadratic equation in 0 
n0?2+ {2(b+c)+d—a}0—2d = 0. 

Since the product of the coefficient of 6? and the constant term is negative, the 
product of the roots of the quadratic must also be negative, and only one root can be 
positive. Only this positive root falls into the permissible range for 0. Its value 0 
is given by 

2nh = fa—d—2(b+c) }+[ {a+2(b+c)+3d}*—8a(b+c) }'. 

The ML estimator 6 can very simply be evaluated from this formula. For Fisher’s 

(genetical) example, where the observed frequencies are 
a= 1997, b = 906, c = 904, d = 32, n = 3839 
the value of 6 is 0-0357. 
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It is easily verified from a further differentiation that 


ee 1 _ 26(1-6)(2+86) 
bie 0" log L = — a(1+20) ’ 
062 


the value being 0-0000336 in this case, when @ is substituted for 6 in var 6. 

For illustrative purposes, we now suppose that we wish to find 6 iteratively in this 
case, starting from the value of an inefficient estimator. A simple inefficient esti- 
mator which was proposed by Fisher is 

t = {at+d—(b+c) }/n, 
which is easily seen to be consistent and has variance 
vart = (1—6?)/n. 
The value of ¢ for the genetical data is 
= {1997+ 32—(906+ 904) }/3839 = 0-0570. 

This is a long way from the value of 6, 00357, which we seek. Using (18.46) we 

have, for our first approximation to 0, 


6, = 0-0570+ (E—) (var 6)o_y. 
00 Jo=t 
dlog L 1997 1810 32 
ee ee eee ee 
ks ( 20 a 2-0570 0-9430 00570 
2 x 0-057 x 0-943 x 2-057 
(var B)o-o0s0 = ~~ aay te, = 000005170678, 


so that our improved estimator is 
6, = 0:0570 — 387-1713 x 0-00005170678 = 0-0570 —0-0200 
= 0-0370, 
which is in fairly close agreement with the value sought, 0-0357. A second iteration 
gives 


dlog L 1997 1810 32 
_ 1997 1810 | 32 _ _34.31495 
( a0 ae 2-037 0-963 0-037 
a 2 x 0-037 x 0-963 x 2:037 
(var 0) 90-0370 =. 3839 = 1-074 = 0 00003520681, 
and hence 6, = 0-0370—34-31495 x 0-00003520681 = 0-0370—0-0012 
= 0-0358. 


This is very close to the value sought. At least one further iteration would be 
required to bring the value to 0-0357 correct to 4 d.p., and a further iteration to confirm 
that the value of 6 arrived at was stable to a sufficient number of decimal places to 
make further iterations unnecessary. The reader should carry through these further 
iterations to satisfy himself that he can use the method. 

This example makes it clear that care must be taken to carry the iteration process 
far enough for practical purposes. It is a somewhat unfavourable example, in that t has 


an efficiency of - es 


(1 +6) (1 + 20) which takes the value of 0-13, or 13 per cent, when 
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0 = 0-0357 is substituted for 0. One would usually seek to start from the value of 
an estimator with greater efficiency than this. 


ML estimators for several parameters 

18.22 We now turn to discussion of the general case, in which more than one 
parameter are to be estimated simultaneously, whether in a univariate or multivariate 
distribution. If we interpret 0, and possibly. also x, as a vector, the formulation of 
the ML principle at (18.2) holds good : we have to choose the set of admissible values 
of the parameters 0,,...,0, which makes the LF an absolute maximum. Under 
the regularity conditions of 18.2~3, the necessary condition for a local turning-point 
in the LF is that 


log L (x|,, Se Se gk ee oe ee (18.47) 
and a sufficient condition that this be a maximum is that the matrix 


07 log L 
( “a 5) (18.48) 
be negative definite. The k equations (18.47) are to be solved for the k ML esti- 
mators 0,....., 0;. 


The case of joint sufficiency 

18.23 Just as in 18.4, we see that if there exists a set of s statistics t,,..., t, 
which are jointly sufficient for the parameters 0,,..., 0,, the ML estimators 6,,... , 9; 
must be functions of the sufficient statistics. As before, this follows immediately from 
the factorization (cf. (17.84)). 


E@l0,)).<.5 0) Se. |G ewe (18.49) 


in virtue of the fact that h(x) does not contain 6,,..., 0,. 

However, the ML estimators need not be one-to-one functions of the sufficient 
set of statistics, and are therefore not necessarily themselves a sufficient set. In Ex- 
ample 18.5, we have already met a case where the ML estimator of a single parameter 
is a function of only one of the jointly sufficient pair of statistics, there being no single 
sufficient statistic. 


18.24 ‘The uniqueness of the solution of the likelihood equation in the presence 
of sufficiency (18.5) extends to the multiparameter case if s = k, as Huzurbazar (1949) 
has pointed out. Under regularity conditions, the most general form of distribution 
admitting a set of k jointly sufficient statistics (17.86) yields a LF whose logarithm is 
of form 


k n n 
logL = & A;(6) & B;(«,)+ X C(x;)+nD (6), (18.50) 
j=1 i=1 i=1 
where 6 is written for 6,,..., 6, and x is possibly multivariate. The likelihood equa- 
tions are therefore 
Clog L 0A; oD 
20, = 90, ~ Bs lw) +n ae = 0, ae Oe re (18.51) 
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and a solution (6 = 6,, 6,,...,6,) of (18.51) is a maximum if 
0 log L 07 A, 02D 
( 00, 00, ), Tay & 30,522 gta és a sai 


forms a negative definite matrix (18.48). 
From (17.18) we have 


dlog L 0A; oD 
B( 20, ) a 2 20, B(=B,(x) ) +m 55 = 0), (18.53) 
and further 
0 log L 0A, aD 
E (Sr - 5) = = 30, 00. 30, (= B;(x;) ) +1 5939, 20. (18.54) 


Evidently, (18.53) and (18.54) have exactly the same structural form as (18.51) and 
(18.52), the difference being only that T = 4 B;(x;) is replaced by its expectation and 


6 by the true value 0. If we eliminate T from (18.52), using (18.51), and replace 6 
by 6, we shall get exactly the same result as if we eliminate E(T) from (18.54), using 
(18.53). We thus have 


0? log L a? log L 
( 00, 00, a == ( 00, 00, ), (18.55) 
which is the generalization of (18.35). Moreover, from (17.19), 
Wis eee dlog L\? 
- ( 0,2 )- E 1( 9. \, (18.56) 
and analogously 
@logL\ _,, {alogL dlogL 
dees << BE} a0, a0, \ (18.57) 


and we see that the matrix 
A dlogL dlogL 
(Gm) } 1F( o,.  o, )} 
dlogL dlogL 
{cov ( 50. a0, )} (18.58) 
is negative definite or semi-definite. For the matrix on the right-hand side of (18.58) 


is the dispersion matrix D of the variates dlog L/00,, and this is non-negative definite, 
since for any variables x, the quadratic form 


k 2 
E42 (x, E(x.) )u} =u’ Du > 0, (18.59) 
f=1 
where wu is a vector of dummy variables. Thus the dispersion matrix D is non-negative 


definite. If we rule out linear dependencies among the variates, D is positive definite. 
In this case, the matrix on the left of (18.58) is negative definite. Thus, from (18.55), 


the matrix 
log L 
00, 00, 6=6 
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is also negative definite, and hence any solution of (18.51) is a maximum. But under 
regularity conditions, there must be a minimum between any two maxima. Since 
there is no minimum, there can be only one maximum. ‘Thus, under regularity con- 
ditions, joint sufficiency ensures that the likelihood equations have a unique solution, 
and that this is at a maximum of the LF. 


Example 18.11 

_ We have seen in Example 17.17 that in samples from a univariate normal distribution 
the sample mean and variance, # and s?, are jointly sufficient for the population mean 
and variance, «4 and o%. It follows from 18.23 that the ML estimators must be func- 
tions of # and s?.. We may confirm directly that # and s? are themselves the ML esti- 
mators. The LF is given by 


log L = — $nlog (2m) — gn log (07) — X (x; — u)?/(20°), 


whence the likelihood equations are 


mm es 


Ou o o 


dlogh om, (xp)? _ 
d(o?) 2o? 7 


The solution of these is 
“ee 
@ = Sie = 2 
n 


While ji is unbiassed, 6? is biassed, having expected value (n—1)o?/n. As in the one- 
parameter case (18.14), ML estimators need not be unbiassed. 


18.25 In the case where the terminals of the range of a distribution depend on 
more than one parameter, there has not, so far as we know, been any general investiga- 
tion of the uniqueness of the ML estimator in the presence of sufficient statistics, 
corresponding to that for the one-parameter case in 18.6. But if the statistics are 
individually, as well as jointly, sufficient for the parameters on which the terminals 
of the range depend, the result of 18.6 obviously holds good, as in the following example. 


Example 18.12 
In Example 17.21, we saw that for the distribution 


dx 
dF (x) = a 
the extreme observations xq) and x») are a pair of jointly sufficient statistics for « and £. 
In this case, it is clear that the ML estimators 
& = Xa, B = Xn); 
maximize the LF uniquely, and the same will be true whenever each terminal of the 
range of a distribution depends on a different parameter. 


a<x< Bf, 
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Consistency and efficiency in the general multiparameter case 

18.26 Inthe general case, where there is not necessarily a set of k sufficient statistics 
for the k parameters, the joint ML estimators have similar optimum properties, in 
large samples, to those in the single-parameter case. 

In the first place, we note that the proof of consistency given in 18.10 holds good 
for the multiparameter case if we there interpret 0 as a vector of parameters 0,,..., 0, 
and 6, 0* as vectors of estimators of 6. We therefore have the result that under very 
general conditions the joint ML estimators converge in probability, as a set, to the true 
set of parameter values 6». 

Further, by an immediate generalization of the method of 18.16, we may show 
(see, e.g., Wald (1943a) ) that the joint ML estimators tend, under regularity conditions, 
to a multivariate normal distribution, with dispersion matrix whose inverse is given by 


(V3) = - (S8r) oz E(“3e=. 8), (18.60) 


00, 00, 00, —0~ 
We shall only sketch the essentials of the proof. The analogue of the Taylor 
expansion of (18.31) becomes, on putting the left-hand side equal to zero, 
dlog L ae log L 
( 20, ia = 2 (Gs— O10) (- 20, oo)? yom i 2, SS ae. (18.61) 


Since 0* is a value converging in probability to 65, and the second derivatives on the 
right-hand side of (18.61) converge in probability to their expectations, we may regard 
(18.61) as a set of linear equations in the quantities (0,—06,,), which we may rewrite 

y= Vz (18.62) 
,z = 6—6, and V- is defined at (18.60). 


dlog L 
00 

By the multivariate Central Limit theorem, the vector y will tend to be multi- 
normally distributed, with zero mean if (18.29) holds good for each 6,, and hence 
so will the vector z be. The dispersion matrix of y is V~! of (18.60), by definition, 
so that the exponent of its multinormal distribution will be the quadratic form (cf. 15.3) 
—ty' Vy. (18.63) 

The transformation (18.62) gives the quadratic form for z 


where y = 


eS me 
—gZ V Z, 
so that the dispersion matrix of z is (V-1)-! = V, as stated at (18.60). 


18.27 If there is a set of k jointly sufficient statistics for the k parameters, we may 
use (18.55) in (18.60) to obtain, for the inverse of the dispersion matrix of the ML 
estimators in large samples, 


= O*log L dlogL dlogL 
5, eee — ? 
(V3) = (ao. a (“A= a8 ),-~ (18.64) 


(18.64), which is the generalization of the result of 18.18, removes the necessity for 
finding mean values. 
If there is no set of k sufficient statistics, the elements of the dispersion matrix may 


be estimated from the sample by standard methods. 
E 
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18.28 The fact that the ML estimators have the dispersion matrix defined at 
(18.60) enables us to establish a further optimum property of joint ML estimators. 

Consider any set ¢,,..., ¢, of consistent estimators (supposed not to be function- 
ally related) of the parameters 0,,...,6;, with dispersion matrix D. As at (17.21), 
we have asymptotically, under regularity conditions, if each consistent estimator is 
asymptotically unbiassed, 


|---| aL GrLO)dey... de a. 


so that, on differentiating, 


| er | SB dey. es ee i iret: 


a0, CO, 237, 
which we may write, in view of (17.18), 
dlog L i i= 7, 
, ae a 2 
cov ( ar) 10 oe (18.65) 
Let us now consider the dispersion matrix’ of the 2k variates t,,..., %, 


ne zs oe cee This is, using (18.65) and the results of 18.26, 
00, 00), 
D— 1, 
pa & pied (18.66) 
where I, is the identity matrix of order k. C, being a dispersion matrix, is non- 
negative definite (cf. (18.59)). Hence its determinant is non-negative. The deter- 
minant of the matrix 


I, —V 
M=(0 “y) 
is also non-negative. ‘Thus 
[Me j= 4M 11 e4 > 0: 
Since MC = Cy" t) 
it follows at once that |D—V| > 0, which implies (cf. 19.8 below) 
Ore tr - (18.67) 


Thus the determinant of the dispersion matrix of any set of estimators, which is called 
their generalized variance, cannot be less than | V| in value asymptotically. But we 
have already seen in 18.26 that the ML estimators have | D| = | V| asymptotically. 
Thus the ML estimators minimize the generalized variance in large samples, a result 


due originally to Geary (1942a). 


Example 18.13 
Consider again the ML estimators # and s? in Example 18.11. We have 
dO log L n 
“Ou 
@logh mn U(x—-y)? @logh _ —_n(#—p) 


Ao)? = 204 ga re) ot 


ESTIMATION : MAXIMUM LIKELIHOOD 57 


Remembering that the ML estimators # and s? are sufficient.for and o?, we use (18.64) 
and obtain the inverse of their dispersion matrix in large samples by putting # = yu 
and &(x—,)? = no* in these second derivatives. We find 

n 


pose 0 
ae o2 — o*/n 0 
V-1 = 2 so that V= ( 0 Pe 
204 


We see from this that « and s? are asymptotically normally and independently distributed 
with the variances given. However, we know that the independence property and 
the normality and variance of # are exact for any m (Examples 11.3, 11.12); but the 
normality property and the variance of s? are strictly limiting ones, for we have seen 
(Example 11.7) that s/o? is distributed exactly like y? with (n—1) degrees of freedom, 


the variance of s* therefore, from (16.5), being exactly (=)'.2 (n—1) = 204(n—1)/n?. 


18.29 Where a distribution depends on k parameters, we may be interested in 
estimating any number of them from 1 to k, the others being known. Under regu- 
larity conditions, the ML estimators of the parameters concerned will be obtained by 
selecting the appropriate subset of the k likelihood equations (18.47) and solving them. 
By the nature of this process, it is not to be expected that the ML estimator of a par- 
ticular parameter will be unaffected by knowledge of the other parameters of the distri- 
bution. ‘The form of the ML estimator depends on the company it keeps, as is made 
clear by the following example. 


Example 18.14 


For the bivariate normal distribution 


dx dy 1 x —f\? | X— Mi\fY—b 
a aoe —— ee in ee ae = 
#09) = aeacas( pap? | 200-8) { (a a eo ie 
<— 2 
+(*) \) =O a%,) = O56, 0,>0; [p] <1 
2 
we obtain the logarithm of the LF 
log L(x, y| Hi M2 Of OF p) = —nlog (2m) — 3m {log of + log of + log (1 —p?) } 
1 x%—[ly\? x—IM1\ (y—u Y-bea\? 
ae b> 1 oe 1 2 2 
2(1—p?) {( Oy o( Oy )( ox )+( Og y} 


from which the five likelihood equations are 


oe {Soe Fane | os 
= ; p = 0, 
Olly 0, (1—p’) G7 Og (18.68) 
Qlogh _ {Paap Et) 0, : 
Ol 2 (1—p?) Og O73 
LO ee ay Se (Hy)? (Hy) (Y—fa) | 
O(63) 2a? (1p?) {nc ae 010% os 


(18.69) 


dlogL 1 2) =(V—Me)? U(x M1)(¥— fe) 
= = =ie(l-p9i=* = 0 
0 (02) 202 (1 — p?) in P*) aoa 0105 
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dlogL 1 ! (Zezen, B0—ni) 
2 


“a =p) P=) a a 
—(14+ 9) 2E— eH) } = 0. (18.70) 


(a) Suppose first that we wish to estimate p alone, the other four parameters being 
known. We then solve (18.70) alone. We have already dealt with this case, in 
standardized form, in Example 18.3. (18.70) yields a cubic equation for the ML 
estimator #. 

(b) Suppose now that we wish to estimate of, oj and p, “, and jw, being known. 
We have to solve the three likelihood equations (18.69) and (18.70). Dropping the non- 
zero factors outside the braces, these equations become, after a slight rearrangement, 


n(1—p?) = pict oe ee 
: = (18.71) 
wis = ae © (*= m1) (Ya) 


0199 


and n(1—p®) = see a ce a (Y— He) (18.72) 
1 


03 p 19% 
If we add the equations in (18.71), and subtract (18.72) from this sum, we have 


n(1—p?) = 1 — p? &(x— M1) (¥—HMa) 
p 0192 


>> (x*— 1) (¥— #2) 


or po (18.73) 
0192 
Substituting (18.73) into (18.71) we obtain 
1 
t= SE (em) 
; (18.74) 
65 e ly Ha)” 
1 
Fue Ha) (Y— Ha) 
and hence, from (18.73) p =- (18.75) 


0192 
In this case, therefore, the ML estimator # is the sample correlation coefficient 
calculated about the known population means. 
(c) Finally, suppose that we wish to estimate all five parameters of the distribution. 
We solve (18.68), (18.69) and (18.70) together. (18.68) reduces to 


(%— 4) = 2H) 
2 ie (18.76) 
(Gms) _ (és) : 
O» O71 
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a pair of equations whose only solution is | 

e~ i; = yes = 0. (18.77) 
Taken with (18.74) and (18.75), which are the solutions of (18.69) and (18.70), (18.77) 
gives for the set of five ML estimators 


A e Za I y 
fip= 8 = EH, Ly (w—ay(y—5) 


ee (18.78) 


: = e 1 s A A 
fg= 9, = Bly 9) = 


Thus the ML estimators of all five parameters are the corresponding sample moments. 


18.30 Since the ML estimator of a parameter is a different function of the observa- 
tions, according to which of the other parameters of the distribution is known, its large- 
sample variance will also vary. ‘To facilitate the evaluation of the dispersion matrices 
of ML estimators, we recall that if a distribution admits a set of k sufficient statistics for 
its k parameters, we may avail ourselves of the form (18.64) for the inverse of the disper- 
sion matrix of the ML estimators, in large samples. 


Example 18.15 3 

We now proceed to evaluate the large-sample dispersion matrices of the ML 
estimators in each of the three cases considered in Example 18.14. 

(a) When we are estimating p alone, p is not sufficient. But we have already evalu- 
ated its large-sample variance in Example 18.6, finding 

ng 2 (18.79) 
n(1+p?) 

The fact that we were there dealing with a standardized parent distribution is 
irrelevant, since p is invariant under changes of origin and scale. 

(b) In estimating the three parameters o7, 02 and p, the three ML estimators given 
by (18.74) and (18.75) are jointly sufficient, and we therefore make use of (18.64). 
Writing the parameters in the above order, we find for the 3 x 3 inverse dispersion 
matrix 


= = 0? log L 
¥; = UF, = ~(% = i 


oe ee Sere - 

40; 40%02 — 20% 

FP 

~ (1—p2)| 40202 = 404 20% ee. 
SPesieak, wade 
20% 203 1-p/? 

Inversion of (18.80) gives for the large-sample dispersion matrix 
1 20% 2p ojog = p(1—p*) of 
Vi=—|( 2p%oto3 20  — p(1—p®) 0? J- (18.81) 
p(1—p*)oy p(l—p*)og = (1—p*/? 
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(c) In estimating all five parameters, the ML estimators (18.78) are a sufficient set. 
Moreover, the 3x3 matrix Vy! at (18.80) will form part of the 5 x5 inverse variance 
matrix V;! which we now seck. Writing the parameters in the order 4, (2, 9}, 03) P; 
(18.80) will be the lower 3x3 principal minor of V5. For the elements involving 
derivatives with respect to uw, and jz, we find from (18.68) 

log h SS Sa Se ee 5 ee 2 (18.82) 
Re AC—Ay te, c(i) =p 


while 

logL @logL  alogL ; 

Esmennes Gi Sees ie = 0 = 

i ie ee -_ 
at & = U1, ) = My. Thus if we write, for the inverse of the dispersion matrix of the 
ML estimators of “, and /g, 


P= =p 
eS. “7 @yey 
Vist aaa pcs (18.84) 
@,0,  G 
we have 
V-1 = ( os ) (18.85) 
- 


and we may invert Vz! and V;} separately to obtain the non-zero elements of the 
inverse of V1. We have already inverted V5? at (18.81). The inverse ar ¥,* 48, 


from (18.84), 
2, 
v= *( 01 nes (18.86) 


i. 20,65 6, 
SO 
> (gD 
ve=(6 ne (18.87) 


with V, and V, defined at (18.86) and (18.81). 

We see from this result, what we have already observed (cf. 16.25) to be true for 
any sample size, that the sample means are distributed independently of the variances 
and covariance in bivariate normal samples, and that the correlation between the sample 
means is p; and that the correlation between the sample variances is p° (Example 13.1). 

Kale (1962) considers alternative iterative methods for solving the likelihood equations 


for several parameters. (18.62) is used with 0) replaced by a trial vector t (cf. 18.21) 
so that 86 = t+ Vy may be iterated as often as necessary. 


Non-identical parent distributions 

18.31 We have now largely completed our general survey of the ML method. 
Throughout our discussions so far, we have been considering problems of estimation 
when all the observations come from the same underlying distribution. We now 
briefly examine the behaviour of ML estimators when this condition no longer holds. 
In fact, we replace (18.1) by the more general LF 


L(x | 03). - +> 9%) = fa(%1|0)fa(%010) - - «Sa (%n1 9); (18.88) 
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where the different factors f; on the right of (18.88) depend on possibly different func- 
tions of the set of parameters 6,,..., Ox. 

It is not now necessarily true even that ML estimators are consistent, and in one 
particular class of cases, that in which the number of parameters increases with the 
number of observations (Rk being a function of m), the ML method may become in- 
effective. ‘The two following examples illustrate the points. 


Example 18.16 


Suppose that x; is a normal variate with mean 6; and variance o? > 0 (7 = 1,2,...,7). 
Pp i 


The LF (18.88) gives 
log L = —$nlog (2x) — $n log (o?) — o 2 (x; —0;)? 
2X (x;—6;)? 


0 (0?) Zo Zo 
yields the ML estimator 


and 


© (x; —0,)?. (18.89) 


But, since we only have one observation from each distinct normal distribution, we 
also have 


6, = x, (18.90) 
so that if we estimate o? and 6; (¢ = 1,2,..., m) jointly, (18.89) and (18.90) give 
eo = §, 


an absurd result. We cannot expect effective estimation of o? with one observation 
from each distribution: (18.90) is a completely useless estimator. 

However, the situation is not much improved if we have two observations from 
each of the normal distributions. We then have 


6; = $(%4 +419) = %1, 

«5 *¢ (18.91) 
— oe 
ga EE yay 


But since 


2 
| E{h 3 (vy—#)*} = Jot 
j=1 
(cf. Example 17.3) we have from (18.91), for all 1, 
E' (6?) = 40°, 
so that 6? is not consistent, as Neyman and Scott (1948) pointed out. What has hap- 


pened is that the small-sample bias of ML estimators (18.14) persists in this example 
as m increases, for the number of distributions also increases with n. 


Other examples of the type of Example 18.16 have been discussed in the literature 
in connexion with applications to which they are relevant. We shall discuss these 
as they arise in later chapters, particularly Chapter 29. Here we need only emphasize 
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that careful investigation of the properties of ML estimators is necessary in non- 
standard situations—it cannot be assumed that the large-sample optimum properties 
will persist. For example, the ML estimator may not even exist (cf. Exercise 18.34) 
or there may be a multiplicity of ML estimators, some of which are inconsistent (cf. 
Exercise 18.35). 


The use of the Likelihood Function 


18.32 It is always possible, as Fisher (1956) recommends, to examine the course 
of the LF throughout the permissible range of variation of 6, and to draw a graph of 
the LF. While this may be generally informative, it does not seem to be of any immedi- 
ate value in the estimation-problem. ‘The LF contains all the information in the sample 
precisely in the sense that, as we remarked in 17.38, the observations themselves con- 
stitute a set of jointly sufficient statistics for the parameters of any problem. ‘This way 
of putting it has the merit of drawing attention to the fact that the functional form of 
the distribution(s) generating the observations must be decided before the LF can 
be used at all, whether for ML estimation or otherwise. In other words, some in- 
formation (in a general sense) must be supplied by the statistician: if he is unable or 
unwilling to supply it, resort must be had to the non-parametric hypotheses to be 
discussed in later chapters, and the quite different methods to which they lead. 

In 23.37 below, we discuss some recent developments linking the use of the LF 
with conditional inferential procedures. 


The estimation of location and scale parameters 

18.33 We may use ML methods to solve, following Fisher (1921a), the problem 
of finding efficient estimators of location and scale parameters for any given form of 
distribution. : 

Consider a frequency function 


dF (x) = #(*5*)a(5°), B > 0. (18.92) 
The parameter « locates the distribution and f is a scale parameter. We rewrite 
(18.92) as 


dF = exp {g(y) }dy = exp {g(y) }dx/B, (18.93) 
where y = (x—«)/B, g(y) = logf(y). 
‘In samples of size n, the LF is 
log L(x], 8) = % ¢(y:)—nlog p. (18.94) 
(18.94) yields the likelihood equations 
dlog L : eee 
“ = — R28 (yi) = 0, 
dlogL 1 eee 


ap — Bae'od +9} = 
where g’(y) = dg(y)/dy, Under regularity conditions, solution of (18.95) gives the 
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ML estimators & and f. We now assume that for all permissible values of « and 


B (238%) = ~FEfe(»)} = 0, 
aoe : | (18.96) 
B (8 ) = FE Ge) }+1] =0. 


As at (17.18), (18.96) will hold if we may differentiate under the integral signs on 
the left. We may rewrite (18.96) as 


E {g'(y)} = 9, (18.97) 
E{yg'(y)} = 1. (18.98) 
We now evaluate the elements of the inverse dispersion matrix (18.60). If 


B( Se") 5 -E (“SE “a-} 


00,00, oo. ; ae 
these are as follows. From (18.95), dropping the argument of g(y), we have 
0 log L os; ” 
E( = ) = FEC") (18.99) 
log L n 
= ie «gf t 
E(“SBA) = BEC’ y+ 28'y+1), 


which on using (18.98) becomes 


0 log L nN ”t 
See = —f 2 e e 
E( om ) pe(e’y*-1) (18.100) 
Also 


logL\ nn p70 
which on using (18.97) becomes 


O log L — nN ” 
E( se ) = REE"). (18.101) 
(18.99-18.101) give, for the matrix (18.60), 
oe -52(8, e ! ) 18.102 
ae? 2S ee Sa ( 


from which the variances and covariance may be determined by inversion. Of course, 
if « or B alone is being estimated, the variance of the ML estimator will be the reciprocal 
of the appropriate term in the leading diagonal of V-. 


18.34 If g(y) is an even function of its argument, i.e. the distribution is symmetric 
about «, (18.102) simplifies. For then 


&(y) = 8(-9); 
g (y) = —28'(—9); 
g’(y) = 2" (-9). (18.103) 


Using (18.103), we see that the off-diagonal term in (18.102) 
Eig’ (y)y} = 0, (18.104) 
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so that (18.102) is a diagonal matrix for symmetric distributions. Hence the ML 
estimators of the location and scale parameters of any symmetric distribution obeying 
our regularity conditions will be asymptotically uncorrelated, and (since they are 
asymptotically bivariate normally distributed by 18.26) asymptotically independent. In 
particular, this applies to the normal distribution, for which we have already derived 
this result directly in Example 18.13. 


18.35 Even for asymmetrical distributions, we can make the off-diagonal term in 
(18.102) zero by a simple change of origin. Put 


z= ye. (18.105) 
Then 
roel oo) 
een = E(g"'2)+E(g"y), 
E(g"z) = 0. (18.106) 


Thus if we measure from an origin as in (18.105), we reduce (18.102) to a diagonal 
matrix, and we obtain the variances of the estimators easily by taking the reciprocals 
of the terms in the diagonal. The origin which makes the estimators uncorrelated 
is called the centre of location of the distribution. ‘The object of choosing the centre 
of location as origin is that, where iterative procedures are necessary, the estimators 
may be separately treated. 


Example 18.17 
The distribution 
ee ee (x — a) x— oO 
dF (s) = po (*5") ex = a( ) ag x= 0; PSD eee 
OTTe ct? Fee 
has its range dependent upon «, but is zero and has a zero first derivative with respect 


to « at its lower terminal for p > 2 (cf. Exercise 17.23), so our regularity conditions 
hold. Here 


g(y) = —log ['(p)+(p—1)logy—y, 


B(g") = B{-PMh = - 
fea y= p{—e—" = —l, 


Eg) Se a = 8 
Thus the centre of location is, from (18.105), 


aie SS 
ney = 


and 
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The inverse dispersion matrix (18.102) is 


1 
==. | 
V-1 = (P= ) (18.107) 
a 
and its inverse, the dispersion matrix of & and f, is easily obtained directly as 
= ee 
"= | ' ) (18.108) 
2n 6 Seaton 
p-2 


If we measure from the centre of location, we have for the uncorrelated estimators 
Ss Pa 
var &, = (p—2) B2/n, 18.109 
var B,, = B2/(2n). = 
Comparing (18.109) with (18.108) and (18.107), we see that var f is unaffected by the 
change of origin, while varé, equals var& when « alone is being estimated. 


Efficiency of the method of moments 

18.36 In Chapter 6 we discussed distributions of the Pearson type. We were 
there mainly concerned with the properties of populations only and no question of 
the reliability of estimates arose. If, however, the observations are a sample from 
a population, the question arises whether fitting by moments provides the most efficient 
estimators of the unknown parameters. As we shall see presently, in general it does not, 

Consider a parent form dependent on four parameters. If the ML estimators of 
these parameters are to be obtained in terms of linear functions of the moments (as 
in the fitting of Pearson curves), we must have 


en = A)+a,0%x%+a,4%x2 +a, 0x3 +a, Ux, foe 1 4, 1S 
and consequently 
f(x|01,..., 04) = exp (bo +b, x+b,x7+b,x°+5, x4), (18.111) 


where the b’s depend on the 6’s._ This is the most general form for which the method 
of moments gives ML estimators. The 8’s are, of course, conditioned by the fact 
that the total frequency shall be unity and the distribution function converge. 

Without loss of generality we may take b, = 0. If, then, b; and Jy are zero, the 
distribution is normal and the method of moments is efficient. In other cases, (18.111) 


does not yield a Pearson distribution except as an approximation. For example, 
dlo 


PBS — 242+ 3byxt+ Ady. 


If 6, and 4 are small, this is approximately 
dlog f _ 20x X 
Ox a ee (18.112) 
ie By 


which is one form of the equation defining Pearson distributions (cf. (6.1) ). Only 
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when b, and 4, are small compared with 5, can we expect the method of moments to 
give estimators of high efficiency. 


18.37 A detailed discussion of the efficiency of moments in determining the para- 
meters of a Pearson distribution has been given by Fisher (1921a). We will here quote 
only one of the results by way of illustration. 


Example 18.18 


Consider the Gamma distribution with three parameters, «, o, p, 


1 x—a\?-1 XO 
= fee Se ~ ere oa, 
dk mol z exp | ( : ) bas kere oo, eee, 


For the LF, we have 
logL = —nplogo—nlog I'(p)+(p—1) Zlog (x—a)—X(x—«)/o. 
The three likelihood equations are 


dlogL _ I = 

dae ea 
bsg 
dlogh _ ots ee ['(p)+Xlog(x—a) = 0. 


ipory etasr dp 
Taking the parameters in the above order, we have for the inverse dispersion matrix 


(18.60) 


al 2 ass i 
o*(p—2) o? o(p—1) 
1 1 
ities Gee ieee awe 
1 1 d*log T'(p) 
o(p—1) o — dp 
with determinant Plog F(p) : 
= —_—— OB Pl. A 
seme p ama dp? te 


From this the sampling variances are found to be 
E 1 { @log V(p) _ -1}, 
dp? 
1 72 1... dog? (f) i= 1 \ 
nhe\p-Z2 dp (pI 
2 
var p = a-De = =/ 2 2 oe : i+ got (18.113) 
Now for large p, using Stirling’s series, 


a? 1 
Slog P(1+2) = F{5 108 Cn) + (p+ dloge—P +95 —geqp3+ ---} 


A 
var o = 
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We then find | 
d? Be Poses PS 4 
and hence approximately, from (18.113), 


var p = °1 (p= 1)+3(P-)} (18.114) 


If we estimate the parameters by equating sample-moments to the appropriate moments 
in terms of parameters, we find 
atop = my, 
o*p = Mm, 
Le P= Kis, 
so that, whatever « and o may be, 

6, = wt /m, = 4/p, (18.115) 
where b, is the sample value of f,, the skewness coefficient. Now for estimation by 
the method of moments (cf. 10.15), 

var, = 2 (48, —248,+36+981 Ba— 1285 +3561}, 
which for the present distribution reduces to 
abt ETM ey5) (18.116) 
n p 


Hence, from (18.115) we have for f, the estimator by the method of moments, 


var b, 


a 6 
var p = ¢vare a ~b(p+1)(p+5). 


For large p the efficiency of this estimator is then, from (18.114), 
var p _ {(p—1)+4(p—1)} 
varp = p (P+ 1) (P +5) 

which is evidently less than 1. When p exceeds 39-1 (8, = 0-102), the efficiency is 

over 80 per cent. For p = 20 (f; = 0-20), it is 65 per cent. For p = 5, a more 
ad log (1+ ) 
dp* 


bd 


shows 


exact calculation based on the tables of the trigamma function 
that the efficiency is only 22 per cent. 


EXERCISES 


18.1 In Example 18.7 show by considering the case n = 1 that the ML estimator 
does not attain the MVB for small samples; and deduce that for this distribution the 


efficiency of the sample mean compared with the ML estimator is 34. 


18.2 If the ML estimator 0 is a root of 0 log L/00 = 0, show that the most general 
form of distribution differentiable in 6, for which 6 = #, the sample arithmetic mean, is 
f(x|6) = exp {A(6) + A’(6) (x —6) +B(x)} 
and hence that * is sufficient for 0, with MVB variance of {7A’’(0)}—}. Show that if 
6 is a location parameter, f is a normal distribution with mean 6@ (a result going back to 


Gauss), while if 9 is a scale parameter, f = 0~1 exp (—x/0). 
(Cf. Keynes (1911) and Teicher (1961)) 
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18.3. Show that the most general continuous distribution for which the ML 
estimator of a parameter 0 is the geometric mean of the sample is 
0A'(0 
3 


(8) 
f(x |0) = 8 exp {4 (6) + B(x) }. 


Show further that the corresponding distribution having the harmonic mean as ML 
estimator of 0 is 


1 
F210) = ex | 4 A) }—a' +809]. 
(Keynes, 1911) 


18.4 In Exercise 18.3, show in each case that the ML estimator is sufficient for 6, 
but that it is not a MVB estimator of 9, in contrast to the case of the arithmetic 
mean in Exercise 18.2. Find in each case the function of 6 which is estimable with 
variance equal to its MVB, and evaluate the MVB. 


18.5 For the distribution 
d(x) < exp {—(x—«)/B} dx, "0 < a & x & B, 


show that the ML estimators of « and f are x1) and xn) respectively, but that these are 
not a sufficient pair for « and f. 


18.6 Show that for samples of m from the extreme-value distribution (cf. (14.66) ) 
dF (x) = aexp {—a(x—p)—exp[—a(x—p)] } dx, —-O <x < a, 
the ML estimators & and fi are given by 
| 1 .- Bee 


— OS ete? 


R | 


Fe : 
eo — —¥) e— 4x, 
n 
and that in large samples 
Var &-= 27/07" /6), 


os eae 
vari = at 72/6 \, 


cov (4,4) = —(1—y)/(x*/6), 
where y is Euler’s constant 0°5772. ... (B. F. Kimball, 1946) 


18.7. If x is distributed in the normal form 


1 1 /x—p\? 


the lognormal distribution of y = e* has mean 6, = exp(u+4o?) and variance 
6, = exp (2u+o7) {exp(o7)—1}. (Cf. (6.63).) 
Show that the ML estimator of 0, is 


6, = exp («+ 4s), 
where * and s? are the sample mean and variance of x, and that 


E@,) =F exp @ YE texp Gs) = Ofexp { —_— 4 (: _= _ 


2 n 


so that 0, is biassed upwards. Show that E(6,) —> 0,, so 9, is asymptotically unbiassed 
n—> 00 
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18.8 In Exercise 18.7, define the series 
n—1 #? (n—1)? #8 
t)=1 —— — +--+ ________4 .,,, 

IO AE eee at 


Show that the adjusted ML estimator 
6, = exp (x) f (4s?) 


is strictly unbiassed. Show further that 6, > 6, for all samples, so that the bias of 6, 
over 0, is uniform. 


18.9 In Exercise 18.7, show that 
= E {exp (2%) } E {exp (s*) }— {E(0,) ? 
pe oe ( Epis | 


var 0, = 
exp (24+ 0?/n) exe {o?/n} ( —— 


ee haa 


n 


exactly, with asymptotic variance 
= 1 
var 6, ~ exp (2u+ 07) .— (07 + 40%), 
n 


and that this is also the asymptotic variance of 6, in Exercise 18.8. Hence show that the 


unbiassed moment-estimator of 4, 
=a 
* dae pa y; 
has efficiency 
(a? + 30%) / {exp (o”) — 1}. 


(Exercises 18.7-9 are due to Finney (1941) and H. S. Sichel (1951-2)) 


18.10 A multinomial distribution has 1 classes, each of which has equal probability 
1/n of occurring. In a sample of N observations, k classes occur. Show that the LF 


for the estimation of 7 is 
N! 1\" /n\ - k! 
IT (ri!) IT (mj!) 
i=1 j=l 
where 7;(> 0) is the number of observations in the ith class and m; is the number of 
classes with 7(> 1) observations in the sample. Show that the ML estimator of 7 1s # 
where : 
N se 1 
5g p> see: 
ar ae 


and hence that approximately 
N n 

== 4ag | 

nt : n—k+1)’ 


and that & is sufficient for nm. Show that for large N, 
n 


var 1 ~ exn(%)-(1+%) 
exp | — }—|1+— 
n n 
(Lewontin and Prout, 1956) 


18.11 In Example 18.14, verify that the ML estimators (18.78) are jointly sufficient 
for the five parameters of the distribution, and that the ML estimators (18.74-75) are 


jointly sufficient for o7, 0% and p. 
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18.12 In estimating the correlation parameter p of the bivariate normal population, 
the other four parameters being known (Examples 18.14 and 18.15, case (a)), show that 
the sample correlation coefficient (which is the ML estimator of p if all five parameters 
are being estimated—cf. (18.78)) has estimating efficiency 1/(1+ )°). Show further 
that if the estimator 

1 
— (x Hi) (¥— Hs) 
r= 
019% 


is used, the efficiency drops even further, to 


1 —p?\? 
t= (Stuart, 1955a) 
18.13 In Exercise 18.12, show that (2x?+Zy?, Lxy) is a pair of sufficient statistics 
for the single parameter p, and that the ML estimator 4 is a function of this pair. When 
n = 1, show that this sufficient statistic is a function of (x, y), itself sufficient, but not 
conversely. 


18.14 In Examples 18.14 and 18.15, find the ML estimators of #, when all other 
parameters are known, and of o? similarly, and show that their large-sample variances 
4% (1p?) 

xO=-)- 
o? when the other three parameters are known, and evaluate their large-sample dispersion 
matrix. 


Find the joint ML estimators of mw, and 


are respectively o[(1—p*)/n and 


18.15 In Example 18.17, derive the results (18.109) for the uncorrelated estimators 
&, and f, measured about the centre of location. 


18.16 Show that the centre of location of the Pearson Type IV distribution 


= —o\2) —#et2) ; 
ar exp {—varetan (*5*)| {14 ("5") | dx, —-o<x< @, 


= 4 to the left of the mode of the 


distribution; that the variance of the ML estimator a% in large samples is 
1 (p+1) (+2) +4), 
Co) 
and that the efficiency of the method of moments in locating the curve is therefore 
p?(p—1) {(o +4)? +7} 
(p +1) (p+2)(p +4) (p?4+ ?). (Fisher, 1921a) 


18.17 For n independent observations from the distribution 
dF = dx, 0-4 <x < 6+, 


show that (x), x(n)) is sufficient for 0, that no single sufficient statistic exists, and that 
the LF is maximum at any value in the interval (xn)—4, xq) +4). Show that the mid- 
point of this interval is an unbiassed estimator of 0. 


where » and p are assumed known, is distant 


18.18 Members are drawn from an infinite population in which the proportion 
bearing a given attribute is 2, the drawing proceeding until a members bearing that 
attribute have appeared. The sample number then attained is n. Show that the distri- 
bution of 7 is given by 
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and that the ML estimator of 2 is a/n. Show also that this is biassed and that its asymp- 
totic variance is 2?(1—2)/a. 


18.19 In the lognormal distribution of Exercises 18.7-8, consider the estimation of 
the variance 6,. Show that the ML estimator 


6, = exp (2%+s*){exp (s*)—1} 
is biassed and that the adjusted ML estimator 


6, = exp (2%) {Fes -f = #) 


18.20 In Exercise 18.19, show that asymptotically 


is unbiassed. 


var 6, ~ an exp (44 + 20?) [2 {exp (a?) —1 }2+ o? {2 exp (co?) —1 }*], 


and hence that the efficiency of the unbiassed moment-estimator s} = — X (y—¥9)? is 
{exp (07) —1 }* {exp (407) — 2 exp (307) + 3 exp (207) —4} 
(Finney, 1941) 


18.21 For a normal distribution with known variance o? and unknown mean yu 
restricted to take integer values, show that the ML estimator / is the integer value nearest 
to the sample mean x, and that its sampling distribution is given by 


ee 
f@= 45 | e~# dt}. 
V7) J (a—n—D Vin)/o 
Hence show that / is an unbiassed and consistent estimator of 4 with asymptotic variance 
= 8a2\# n 
var ps | exp 302) 
decreasing exponentially as m increases. (Hammersley, 1950) 


18.22 In the previous Exercise, define a statistic T by 47 = the integer value 
nearest to $x. Show that 


var T < varfi, when wp is even, 
var T~ 1, when vu is odd, 


and hence that T is consistent or not according to whether y is an even integer or not. 
(Hammersley, 1950; discussion by C. Stein) 


18.23 log (t—y) is normally distributed with mean yu, variance o? where y<t< ©, 


n 
and the three parameters (y, “, 0”) are to be estimated. Defining f(y) = : » log (ti—y), 
nNi=1 
? ee 
(y) = = & flog (ti—v)— A(y)}", show that , 
i=1 


L¥*(y) = pe L(x | y, u, 07) x 4)" IT ee Ba 


and that if t4) is the smallest observed value of ¢, 
lim L**(y) = + 0, 


yt) 
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so that the ML estimator of (y, “, o*) for this three-parameter lognormal distribution 
is always (t4,, — ©, +). 


(Hill, 1963a) 


18.24 For a sample of 1 observations from f(x | 6) grouped into intervals of width h, 


write 
x 


+4h 
f(x|9,h) = | bora 


Show by a Taylor expansion that 


x 


a® f (x | 6) 
f(x|0, h) = hf (e| 0) ee — 


and hence, to the first approximation, that the correction to be made to the ML estimator 


6 to allow for grouping is 
(= 
n @ | dx? 
ieee BS 
ii NF : 


i= —-a 
24 2 Oe 
Pa 3p: loss) 


the value of the right-hand side being taken at 0. 
(Lindley, 1950) 


18.25 Using the last Exercise, show that for estimating the mean of a normal popula- 
tion with known variance, A = 0, while in estimating the variance with known mean, 

= —h?/12. 

Each of these corrections is exactly the Sheppard grouping correction to the corres- 
ponding population moment. To show that the ML grouping correction does not 
generally coincide with the Sheppard correction, consider the distribution 


dF = e*/9dx/0, @>0; 0O<x< o, 
where § = X, the sample mean, and the correction to it is 
2. 
Pie f° 


whereas the Sheppard correction to the population mean is zero. 


A= 
(Lindley, 1950) 


18.26 Writing the negative binomial distribution (5.33) as 


m\—* (k+r—1 m \t 
fr = (145) ( . ota) oe Bias 


show that for a sample of m independent observations, with , observations at the value 
r and n, < n, the ML estimator of m is the sample mean 


a 


m==rT, 


while that of k is a root of 


l oso 
— (145) = cal ed okt 
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Show that as k decreases towards zero, the right-hand side of this equation exceeds the 
left, and that if the sample variance s? exceeds 7 the left-hand side exceeds the right as 
k —> ©, and hence that the equation has at least one finite positive root. On the other 
hand, if s? < 7, show that the two sides of the equation tend to equality as k —> 00, so 
that & = ©0, and f; reduces to a Poisson distribution with parameter m. 

(Anscombe, 1950) 


18.27 In the previous Exercise, show that 


var m = (m+m?/k)/n, 


j m \s-1 
2k (k+1) on: fe oa 


vatk ~ . LL >; 
Fr Gras j=2 k+J 
m+k j—1 


cov (m, k) ~ 0. 
(Anscombe (1950); Fisher (1941) investigated 
the efficiency of the method of moments in 
this case.) 


18.28 For the Neyman Type A contagious distribution of Exercise 5.7, with fre- 
quency function f,;, show that the ML estimators of 4,, 4, are given by the roots of the 
equations 


Ay = Ly My (+1) fro1/ (afr); 
pe — T, 
where n;, 7 have the same meanings as in Exercise 18.26. 

(Shenton (1949), who investigated the efficiency of the 
method of moments in this case and found it to be 
relatively low (< 70%) for small 4, (< 3) and large 
A, (21), and examined (1950, 1951) the efficiency 
of using moments in the general case and in the 


particular case of the Gram-Charlier T'ype A series. 
See also Katti and Gurland (1962).) 


18.29 For the “ logistic’ distribution 

1 
1-+exp {—(«+ Bx) ¥ 
show that the ML estimators of « and # are roots of the equations 


fo = —-O<x< O, 


me 1 4 xi exp (— xi) 2 exp (— Bxi) 
B 421 1+exp {—(«+ Bi) }/ 421 1 +exp {—(%+ Baxi) } 
sei 2 > exp (— fxi) 


nin 1t+exp {—(a+ Bx) } 
(Other estimators for this distribution, discussed by Gupta and Gnanadesikan (1966), 
who use the methods of 19.18-20 (cf. Exercise 32.12 below) and of Blom (1958).) 


18.30 Independent samples of sizes 7, m, are taken from two normal populations 
with equal means m and variances respectively equal to Ao”, o?. Find the ML estimator 
of “, and show that its large-sample variance is 


var (fi) = i (+m), 
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Hence show that the unbiassed estimator 
t = (14%, +12X2)/(n,+N2) 
has efficiency 
A(n,+n,)? 
(nA+ne) (ny +n2A)’ 
which attains the value 1 if and only if 4 = 1. 


18.31 For a sample of 1 observations from a normal distribution with mean @ and 
variance V (0), show that the ML estimator 6 is a root of 


V' = 2(8-0) + AE (w—0), 


and hence that if V (@) = 026", where o? is known, 6 is a function of both # and & x? unless 
k = 0 (when 6 = &, the ‘imi sufficient statistic) or k=1 (when 6 is a function of £2? 
only). 


18.32 A parameter @ is estimated by 6*, a root of the equation g(x, 0) = 0. If the 
regularity conditions for the MVB in 17.14-15 hold, and E{g(x, 6)} = 0, show that 


the equality holding only when g(x, 0) is a constant multiple of 


d log = 
= 


the ML estimator. This generalization reduces to (17.22) when g(x, 6) = t—1(§). 
(Godambe, 1960; Durbin, 1960) 


when 6* = 6, 


18.33 For a random sample from the distribution 
f(x | 0) = exp { —| x—6]}, x <8, 
ee 6<x < 0+1, 
$ ets (= [x—07)) |}. O+1<x, 


show that the ML estimator of # is never unique. (Consider the cases n = 1, nm = 2 in 
particular.) 


(Daniels, 1961) 


18.34 A sample of n observations is drawn from a normal population with mean yu 
and a variance which has equal probabilities of being 1 or o?. Show that as n—> co no 
ML estimator of (“, 0”) exists. 

(Kiefer and Wolfowitz, 1956) 


18.35 A sample of m observations is taken from a continuous frequency function 
f(x), defined on the interval 0 < x < 1, which is bounded by 0 < f(x) < 2. Show that 
1 


an estimator F(x) of F(x) is a ML estimator if and only if | f E97 “2-2 = F (x) is con- 
0 


. a e . . 
tinuous and f(x) = 2, 1=1, 2, ..., m. Hence show that many inconsistent ML 
estimators, as well as consistent ML estimators, exist. 


(Bahadur, 1958) 


CHAPTER 19 
ESTIMATION : LEAST SQUARES AND OTHER METHODS 


19.1 In this chapter we shall examine in turn principles of estimation other than 
that of Maximum Likelihood (ML), to which Chapter 18 was devoted. The chief 
of these, the principle (or method) of Least Squares (abbreviated LS), while concep- 
tually quite distinct from the ML method and possessed of its own optimum properties, 
coincides with the ML method in the important case of normally distributed observa- 
tions. ‘The other methods, to be discussed later in the chapter, are essentially com- 
petitors to the ML method, and are equivalent to it, if at all, only in an asymptotic 
sense. 


The method of Least Squares 
19.2 We have seen (Examples 18.2, 17.6) that the ML estimator of the mean u 
in a sample of m from a normal distribution 


1 I/y—h\? 
= —= d 19.1 
0)~ sgn OZ) }9 ass 
is obtained by maximizing the Likelihood Function 
1 ee 
logL(y|1#) = —grlog 2x ot)—5 5 5 (94—n) (19.2) 
with respect to uw. From inspection of (19.2) it is maximized when 
3 (a) (19.3) 


is minimized. ‘The ML principle therefore tells us to choose / so that (19.3) is at 
its minimum. 

Now suppose that the population mean, y, is itself a linear function of parameters 
Ot = 4, Z 352k We Wie 


k 
= a x; 84, (19.4) 
i=1 
where the x; in (19.4) are not random variables but known constant coefficients com- 


bining the unknown parameters 0;. If we now wish to estimate the 6, individually, 
we have, from (19.3) and (19.4), to minimize 

n k 2 

2 (9,- p2 x64) (19.5) 

j=1 i=1 ; 
with respect to the 6, We may now generalize a stage further: suppose that, instead 
of the 2 observations coming from identical normal distributions, the means of these 
distributions differ. In fact, let 


k 
hy = 2 x:;6;, J = 1, Z, oe eo GM. (19.6) 
i=1 
75 
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We now have to minimize 


2 
p> (y,- aE 6) (19.7) 
with respect to the 6;. 


19.3. The LS method gets its name from the minimization of a sum of squares 
as in (19.7). As a general principle, it states that if we wish to estimate the vector of 
parameters 8 in some expression p (x, 6) = 0, where the symbol x represents an observa- 
tion, we should choose our estimator 6 so that 

n a 
at {p (x;,9) p 
is minimized. 

As with any other systematic principle of estimation, the acceptability of the LS 
method depends on the properties of the estimators to which it leads. Unlike the 
ML method, it has no general optimum properties to recommend it, even asymptotically. 
However, in an extremely important class of situation, it does have the optimum prop- 
erty, even in small samples, that it provides unbiassed estimators, linear in the 
observations, which have minimum variance (MV). This situation is usually described 
as the linear model, in which observations are distributed with constant variance about 
(possibly differing) mean values which are linear functions of the unknown parameters, 
and in which the observations are all uncorrelated in pairs. ‘This is just the situation 
we have postulated at (19.6) above, but we shall now abandon the normal distribution 
assumption which underlay the discussion of 19.2, since this is quite unnecessary to 
establish the optimum property of LS estimators. We now proceed to formalize the 
problem, and we shall find it convenient, as in Chapter 15, to use the notation and 
terminology of matrix and vector theory. 


The Least Squares estimator in the linear model 
19.4 We write the linear model in the form 
y = X60+e, (19.8) 


where y is an (mx 1) vector of observations, X is an (” x k) matrix of known coefficients 
(with n > k), @ is a (kx 1) vector of parameters, and € is an (mx 1) vector of “ error” 
random variables with 
E(e) = 0 (19.9) 
and dispersion matrix 
V(e) = E(ee’) = oI (19.10) 
where I is the (7 xn) identity matrix. (19.9) and (19.10) thus embody the assumptions 
that the ¢; are uncorrelated, and all have zero means and the same variance o?. ‘These 
assumptions are crucial to the results which follow. The linear model can be generalized 
to a less restrictive situation (cf. 19.17 and Exercises 19.2 and 19.5), but the results 
are correspondingly changed. 
The reader should notice particularly that no restriction is placed upon the elements 
of X. By defining these suitably, we shall see in 33.43 below and in Chapter 35 
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(Volume 3) that the present linear model may serve for the analysis of categorized and 
classified observations. 
The LS method requires that we minimize the scalar sum of squares 


S = (y—X6)' (y—X6) (19.11) 
for variation in the components of 8. A necessary condition that (19.11) be minimized 
is that 0S/00 = 0. Differentiating, we have 2X’(y—X6) = 0, which gives for our 
LS estimator the vector 


§ = (X’X)-1X’y, (19.12) 
where we assume that (X’ X), the matrix of sums of squares and products of the ele- 
ments of the column-vectors composing X, is non-singular and can therefore be inverted. 


Example 19.1 
Consider the simplest case, where @ is a (1x1) vector, i.e. a single parameter. 
We may then write (19.8) as 
y = x0+e 


where x is now an (nx1) vector. The LS estimator is, from (19.12), 
6 = (x’x)-1x'y 


n n 
— 2 
= ai ny,/ % Nie 


Example 19.2 
Suppose now that 6 has two components. The matrix X now consists of two 
vectors X,, X,. ‘The model (19.8) becomes 


y = (x, X,) (;:)+ €, 
2 
and the LS estimator is now, from (19.12), 
—— er sy Fe). 
KoX, X_Xo XY 

=f 2a = 22% My\* £2919 

= Sn 2x ) Ee) 
where all summations are over the suffix 7 = 1, 2,..., m. Since (X’X) is the matrix 
of sums of squares and cross-products of the elements of the column-vectors of X, 
and X’y the vector of cross-products of the elements of y with each of the x-vectors 


in turn, the generalization of this example to a 8 with more than two components 
is obvious. 


Example 19.3 
In Example 19.2 we specialize so that x, = 1, a vector of units. Hence, 


gig * bar Pe gS Gee ae 
2, Eee Trey 
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and we now invert the first matrix directly, obtaining 


S244 = kf i ae 
Se ee ee Max oe.) tots, 


Multiplying this by Cx y) we have 
2 


a 1 XZ UY — UN, UxXsy 
{nXix2—(Xx_)?} \-—Lx,Lytnuxy/]’ 
so that 
so 2s tty YBN Ueay 
n Xi xe — (2%) 
P _ NUX,y—UN,wy 
; Nx — (Ux)? 
Simplifying, 
re :. ls 
ae oe * 2a 4:)(9-9) 
lye _flyy : Xi (X%_— Xe)? 
no AS 
and hence 


0, = FJ—O,K.. 
It will be seen that 6, is exactly of the form of 6 in Example 19.1, with deviations 
from means replacing those from origins. This is a general effect of the introduction 
of a new parameter whose x-vector consists wholly of units (see Exercise 19.1). 


19.5 We may now establish the unbiassedness of the LS estimator (19.12). Using 
(19.8), it may be written 


6 = (X’X)-1X'(X0+e) = 0+(X’X)1X’e. (19.13) 
Since X is constant, we have, on using (19.9), 
E(6) = 0 (19.14) 


as stated. The dispersion matrix of 6 is 
V(8) = E {(8—6)(8—8)’ j, 
which, on substitution from (19.13), becomes 
V (6) = E {[(X’X) 7X e] [(X'X) *X’e]’ } 
= (X’X)-1X’ E(ee’) X(X’ X)-}. (19.15) 
Using (19.10), (19.15) becomes 
V (6) = o2(X’X)-. (19.16) 
(19.12) and (19.16) make it clear that the computation of the vector of LS estimators 
and of their dispersion matrix depends essentially upon the inversion of the matrix 
of sums of squares and cross-products (X’X). In the simpler cases (cf. Example 19.3) 
this can be done algebraically. In more complicated cases, it will be necessary to 


carry out the inversion by numerical methods, such as those described by L. Fox (1950) 
and by L. Fox and Hayes (1951). 
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Example 19.4 ; 
The variance of 6 in Example 19.1 is, from (19.16), 


var (6) = o?/Dx?. 
Example 19.5 


In Example 19.2 we have 


Di. 2a Reh 
x= -1 — 1 5 Rae 
( ) toa Ks ) 


=. 1 ave; —Lx1X>, 
 {Sa8 D8 — (Sx, x2)* } Dye, uxt | 
so that, from (19.16), 
i Lk = o? 
var (6;) == {3 x2 D x2 — (Lx, x2)? } = Fa? 1 (21 %2)? ’ 
“1 Dak D xd 


and var (6,) is the same expression with suffixes 1 and 2 interchanged. The covariance 
term is 


—O7 XxX, 
{Exp Di xg — (xy Xe)? FP 
and this is zero if, and only if, Xx,x, = 0. 


cov (6,, 6,) = 


Example 19.6 
We found in Example 19.3 that 


1 
— 1 — aks, = 
(X°X)-1 = Sie 25)! (a 2 ) 
From (19.16), we therefore have 
var (6;) = a? Di x3 /{n& (x2—*Xs)*}, 
var (6,) = 07/2 (x,—#,), 
cov (6,, 6,) = —0?%,/Z(x,—K,)?. 
Var (6,) is, as is to be expected, var (6) in Example 19.4, with deviations from the 
mean in the denominator. 6, and 6, are uncorrelated if and only if #, = 0. 


Optimum properties 
19.6 We now show that the MV unbiassed linear estimators of any set of linear 
functions of the parameters 0; are given by the LS method.“ 'This may be very elegantly 
demonstrated by a method used by Plackett (1949) in a discussion of the origins of 
LS theory which makes it clear that the fundamental results are due to Gauss. 
Let t be any vector of estimators, linear in the observations y, i.e. of form. 


t= Ty. (19.17) 
If t is to be unbiassed for a set of linear functions of the parameters, say C @, we must 
have E(t) = E(Ty) = CO (where C is a known matrix of coefficients), which on 


() Barnard (1963) gives alternative optimum properties of the LS method. 
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using (19.8) gives 


E{T(X6+e)} = C6 (19.18) 
or, using (19.9), since (19.18) must hold identically in 0, 
TX =C. (19.19) 


(19.19) is a necessary and sufficient condition for CO to be unbiassedly estimable by Ty. 
The dispersion matrix of t is 


V(t) = E {(t—C6)(t-Cé)’ } (19.20) 
and since, from (19.17), (19.8) and (19.19), t-C@ = Te, (19.20) becomes 
V(t) = E(Tee’T’) = ?° TT’ (19.21) 


from (19.10). We wish to minimize the diagonal elements of (19.21), which are the 
variances of our set of estimators. 

Now the identity 

TT’ = {C(X’X)-1X’ } {C(X’ X)-1 X’ } 

+ {T—C(X’ X)-1X’} {T—C(X’ X) 1X’ (19.22) 
is easily verified by multiplying out its right-hand side, which is the sum of two terms, 
each of form AA’. Each of these terms therefore has non-negative diagonal elements. 
But only the second term on the right of (19.22) is a function of T. The sum of the 
two terms therefore has strictly minimum diagonal elements when the second term 
has all zero diagonal elements. ‘This occurs when 


T= C'S), (19.23) 
so that the MV unbiassed vector of estimators of C@ is, from (19.17) and (19.23), 
t= C(2 2) Ay, (19.24) 

in which @ is simply replaced by its LS estimator (19.12), i.e. 
t = C6, (19.25) 


and from (19.21) and (19.23) 

Vit)= fC exc, (19.26) 
If C =I, the identity matrix, so that we are estimating the vector @ itself, (19.24) 
and (19.26) reduce to (19.12) and (19.16) respectively. 

Although the LS estimators are MV unbiassed in the linear model among linear 
functions of y, they are not so in general if non-linear functions of y are admitted as 
estimators—this depends on the distribution of the e;. If the latter are normal (and 
hence independent), the stronger property follows from the fact that the LS esti- 
mators are functions of the (k+1) minimal sufficient statistics (6, s?) for the (k+1) 
parameters (0, o?)—see 17.39. ‘IT. W. Anderson (1962) shows that if all possible 
distributions of the ¢; are considered, LS estimators are very rarely MV among all 
estimators. 


19.7 It is instructive to display the MV property geometrically, following Durbin 
and Kendall (1951). We shall here discuss only their simplest case, where we are esti- 
mating a single parameter 0 from a sample of n observations, all with mean @ and 
variance o?. Thus y; = 0+6;, j = 1,2,...,m, which is (19.8) with k = 1 and X 
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an (nx1) vector of units. We consider linear estimators 


{= UC;V5y (19.27) 
| 


the simplest case of (19.17). The unbiassedness condition (19.19) here becomes 
de; = 1, (19.28) 
j 


Consider an 2-dimensional Euclidean space with one co-ordinate for each c;._ We call 
this the estimator space. (19.28) is a hyperplane in this space, and any point P in 
the hyperplane corresponds uniquely to one unbiassed estimator. Now since the y, 
are uncorrelated, we have from (19.27) 
vart = 0? Lc; (19.29) 
j 

so that the variance of ¢ is o2O P?, where O is the origin in the estimator space. It 
follows at once that ¢ has MV when P is the foot of the perpendicular from O to the 
hyperplane. By symmetry, we must then have every c; = 1/n and t = y, the sample 
mean. 

Now consider the usual n-dimensional sample space, with one co-ordinate for each 
y; The bilinear form (19.27) establishes a duality between this and the estimator 
space. For any fixed ¢, a point in one space corresponds to a hyperplane in the other, 
while for varying ¢ a point in one space corresponds to a family of parallel hyperplanes 
in the other. To the hyperplane (19.28) in the estimator space there corresponds 
the point (¢, ¢,..., ¢) lying on the equiangular vector in the sample space. If a vector 
through the origin is orthogonal to a hyperplane in one space, the corresponding 
hyperplane and vector are orthogonal in the other space. 

It now follows that the MV unbiassed estimator will be given in the sample space 
by the hyperplane orthogonal to the equiangular vector at the point L = (y, y,..., 9). 
If O is the sample point, we drop a perpendicular from Q on to the equiangular vector 
to find L, i.e. we minimize OL? = X(y,;—t)?. Thus we minimize a sum of squares 


J 
in the sample space and consequently minimize the variance (another sum of squares) 
in the estimator space, as a result of the duality established between them. 


W. H. Kruskal (1961) gives a completely geometrical approach to LS theory. 


19.8 A direct consequence of the result of 19.6 is that the LS estimator 6 minimizes 
the value of the generalized variance for linear estimators of 8. ‘This result, which 
is due to Aitken (1948), is exact, unlike the equivalent asymptotic result proved for 
ML estimators in 18.28. We give Daniels’ (1951-2) proof. 

The result of 19.6, specialized to the estimation of a single linear function c’8, 
where c’ is a (1 xk) vector, is that 


var(c’6) < var(c’u), (19.30) 
where 6 is the LS estimator, and u any other linear estimator, of 8. We may rewrite 
(19.30) as 

eve << De, (19.31) 


where V is the dispersion matrix of 6 and D is the dispersion matrix of u. 
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Now we may make a real non-singular transformation 


c=Ab (19.32) 
which simultaneously diagonalizes V and D. Using (19.32), (19.31) becomes 
b’(A’VA)b < b’(A’DA)b, (19.33) 


where the bracketed matrices are diagonal. By suitable choice of b, it is easily seen 
that every element of (A’VA) cannot exceed the corresponding element of (A’D A). 
Thus the determinant 


|A’V A| < |A’DA|, 
i.€., |A"| |V||Al < |A’||D]|A], 
or |V| < |D], (19.34) 


the required result. 


Estimation of variance 

19.9 The result of 19.6 is the first part of what is now usually called the Gauss 
theorem on LS. The second part of the theorem is concerned with the estimation 
of the variance, o?, from the observations. 

Consider the set of “‘ residuals”’ in LS estimation, 


y—X6 = [X0+.e]—X[(X’ X)-1X’ (X0+.e)], (19.35) 
by (19.8) and (19.12). The terms in 6 cancel, and (19.35) becomes 
y—X6 = {1,—X(X’X)-! Xe, (19.36) 


where I, is the identity matrix of order n. 

Now the matrix in braces on the right of (19.36) is symmetric and idempotent, 
as may easily be verified by transposing and by squaring it. Thus the sum of squared 
residuals is 


(y — X6)’ (y— X6) = e {I,—X(X'X)1 Xe. (19.37) 
Now 
e’ Be = Ps b;; €+ p> bj; &; €;. 
inj 
Thus, from (19.10), E(e' Be) = otrB. (19.38) 


Applying (19.38), we have from (19.37) 
E {(y—X6)'(y—X6)} = o?tr {I, -X(X’X)7X} 
= o*[trI,—tr {X (X' X)-1X’} ], (19.39) 
and we may commute the matrices X (X’X)-! and X’ under the trace operator, con- 
verting the product from an (x) to a (kx) matrix. The right-hand side of (19.39) 


becomes 
= o* {trI,—trX’.X(X'X)} = o°(trI,—trI,), 


so that E{(y—X6) (y—X6)} = o?(n—R). (19.40) 
Thus an unbiassed estimator of o? is, from (19.40), 
_*_-(y—X6)' (y—X6) = (19.41) 


the sum of squared residuals divided by the number of observations minus the number 
of parameters estimated. 
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This result permits unbiassed estimators to be formed of the dispersion matrices 
at (19.16) and (19.26), simply by putting s? of (19.41) for o? in those expressions. 


Example 19.7 
The unbiassed estimator of o? in oe 19.1 and 19.4 is, from (19.41), 


: OF x;6)? 


so that var(@) is estimated unbiassedly = ‘2/3 #5: 


c= 
n—1;_ 


Example 19.8 
In Examples 19.2 and 19.5, the unbiassed estimator of o? 1s, from (19.41), 
ae 

“2 


(3°) = 1 ( Dees me rae 
0. {ame 2 he — (x, 5)? — ky X,Y 
= 1 2S aeeartes 
{x2 D xz — (Lx xe)? }\ UPL xgy—Usxypx_gUxyzy/’ 
and we reduce this to the situation of Examples 19.3 and 19.6 by putting all x,; = 1. 


se = 


a. (y;—*4; 0, —%9;0.)2, 


where 


The normality assumption 

19.10 All our results so far have assumed nothing concerning the distribution 
of the errors, ¢;, except the conditions (19.9) and (19.10) concerning their first- and 
second-order moments. It is rather remarkable that nothing need be assumed about 
the forms of the distributions of the errors: we make unbiassed estimators of the 
parameters and, further, unbiassed estimators of the sampling variances and covariances 
of these estimators, without distributional assumptions. However, if we wish to test 
hypotheses concerning the parameters, distributional assumptions are necessary. We 
shall be discussing the problems of testing hypotheses in the linear model in Chapter 24 ; 
here we shall only point out some fundamental features of the situation. 


19.11 If we postulate that the ¢; are normally distributed, the fact that they are 
uncorrelated implies their independence (cf. 15.3), and we may use the result of 15.11 
to the effect that an idempotent quadratic form in independent standardized normal 
variates is a chi-squared variate with degrees of freedom given by the rank of the 
quadratic form. Applying this to the sum of squared residuals (19.37), we have, in 
the notation of (19.41), the result that (7—k)s?/o? is a chi-squared variate with (n—k) 
degrees of freedom. 

Further, we have the identity 


y y = (y—X6)'(y—X6) + (X6)’ (X6), (19.42) 
which is easily verified using (19.12). The second term on the right of (19.42) is 
6’X’X6 = y’X(X’X)-1X’y = (e’+ 0’X’)X(X’X)-1X'(KO+e). (19.43) 
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From (19.43) it follows that if 6 = 0, 


6’X’ X6 = e’X(X’X)-!X’e, (19.44) 
and (19.42) may then be rewritten, using (19.37) and (19.44), 
e’e = €' {I,—X(X’ X)-1X’ }e+e’ {X(X’ X)1X’ fe. (19.45) 


We have already seen in 19.9 that the rank of the first matrix in braces on the 
right of (19.45) is (n—k), and we also established there that the rank of the second 
matrix in braces in (19.45) is k. Thus the ranks on the right-hand side add to n, 
the rank on the left, and Cochran’s theorem (15.16) applies. ‘Thus the two quadratic 
forms on the right of (19.45) are independently distributed (after division by o* in 
each case to adjust the scale) like chi-squared with (n—k) and k degrees of freedom. 


19.12 It will have been noticed that, in 19.11, the chi-squared distribution of 
(y —X6) (y—X6) holds whatever the true value of 0, while the second term in (19.42), 
(X6)’ (X6), is only so distributed if the true value of @ is 0. Whether or not this 
is so, we have from (19.43), using (19.9), 


E{ (X6)' (X6)} = Efe’ X(X’ X)-1 Xe} +0’ X' XO. (19.46) 
We saw in 19.9 that the first term on the right has the value ko®. ‘Thus 
E{(X6) (X6)} = ko?+(X6)’ (X6), (19.47) 


which exceeds ko? unless X 6 = 0, which requires 8 = 0 unless X takes special values. 
Thus it is intuitively reasonable to use the ratio (X6)’ (X6)/(ks*) (where s*, defined 
at (19.41), always has expected value o*) to test the hypothesis that 6 = 0. We shall 
be returning to the justification of this and similar procedures from a less intuitive 
point of view in Chapter 24. 


The singular case 


19.13 In 19.4 we assumed X’X to be non-singular, so that (19.12) was valid, and 
n>k, so that (19.41) could be valid. If m =k, (19.12) still holds if (X’X)-* exists, 
but (19.41) is useless since the sum of squared residuals is identically zero, as (19.37) 
shows. If n<k, the rank of X (and that of X’X, which is the same) is less than &, 
so X’X has no inverse. 

We now let X (and X’X) have rank r<k and suppose thatn>r. ‘The LS estimation 
problem must be discussed afresh, since X’X has no inverse and (19.12) is invalid. 
The treatment follows Plackett (1950). 

The condition (19.19) is still necessary and sufficient for CO to be unbiassedly 
estimable by Ty. In this singular use, it cannot be satisfied if we wish to estimate 
6 itself, when it becomes 


iy ae a | (19.48) 
For, remembering that X is of rank 7, we partition it into 


: : a 
X = ie eS 3 ) (19.49) 
>. are . aes k—r 


the suffixes of the matrix elements of (19.49) indicating the numbers of rows and columns. 
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We assume, without loss of generality, that X,,, is non-singular, and therefore has 
inverse X;}. The last m—r rows of X are linearly dependent upon the first 7 rows, 
so that X,_,., = CX, ,and X,,_, ,_, = CX, ,_, for some (n—r) xr matrix C. Define 
a new matrix, of order kx(k—7), 


ee & k—r 
= ( ee a ) (19.50) 


where I,,_, is the identity matrix of that order. Evidently, D is of rank R—r. If we 
form the product XD, we see at once that 


XD = 0. (19.51) 
If we postmultiply (19.48) by D, we obtain 
R= Tap =4 (19.52) 


from (19.51). This contradicts the fact that D has rank kR—r. Hence (19.48) cannot 
hold. 


19.14 In order to resolve this difficulty we must introduce a set of linear con- 
straints 
c= BS. (19.53) 
where c is a (k—r) x 1 vector of constants and B is a (k—r) xk matrix, of rank (k—7). 
We now seek an estimator of the form t = Ly+Ne. The condition (19.48) now be- 
comes 
I= LX+NB, (19.54) 
and in order to avoid a contradiction, as at (19.52), we lay down the non-singularity 
condition 


IBD| # 0. (19.55) 


19.15 We may now proceed to a LS solution. B, of rank (k—r), makes up the 
deficiency in rank of X. In fact, we treat ¢ as a vector of dummy random variables, 
and solve (19.8) and (19.53) together, in the augmented model 


(*) e (3) 0+ (5): (19.56) 


> = X’X-+B’B is positive definite, for a non-null vector d makes 


ae Se 
The matrix ®) & 
d’X’Xd = (Xd)’Xd equal to zero only if Xd = 0, whence d must be a column of D. 
But (19.55) ensures that Bd 4 0, so that d’B’Bd>0. Thus X’X+B’B is strictly 
positive definite and may be inverted. 
(19.56) therefore yields, as at (19.12), the solution 
6 = (X’X+B’B)-1(X’y+B’c), (19.57) 
_ which as before is the MV linear unbiassed estimator of 8. Its dispersion matrix is, 
since c is not a random variable, 
V (6) = o? (X’X+B’B)-1X’X(X’X+B’B)-1. (19.58) 
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The matrix B in (19.53) is arbitrary, subject to (19.55). In fact, if for B we sub- 
stitute UB, where U is any non-singular (k—r)x(k—r) matrix, (19.57) and (19.58) 
are unaltered in value. Thus we may choose B for convenience in computation in 
any particular case. 


19.16 Exercise 19.8 shows that o? is estimated unbiassedly by the sum of squared 
residuals divided by (n—r) if n>r. 

Chipman (1964) gives a detailed discussion of LS theory in the singular case. See 
also T. O. Lewis and Odell (1966). 


Example 19.9 
As a simple example of a singular situation suppose that we have 


; 110 
101 
0 (0, c=, 
Os 101 


Here n = 4, k = 3 and X has rank 2 < k because of the linear relation between its 
column vectors 
X,—X,—x, = 0. 
We first verify that @ cannot be unbiassedly estimated, as we saw in 19.13. 
The matrix D at (19.50) is of order 3x1, being 


5 (io) “G)) (1) 


1 
—1 
expressing the linear relation. We now introduce the matrix of order 1 x3 
B=(1 6 0), 


which satisfies (19.55) since BD = 1, a scalar in this case of a single linear relation. 


Hence (19.53) is 
0, 
e=t 0 (°, = 0, 


Os 
again a scalar in this simple case. From (19.57), the LS estimator is 


0, 1111 oe 1 Titi, fee 
(*,) - (10 10] ee (0)a00 (1010) fs a0) 
6, a0 ay 0 0104 i 
3 2 2\ 7) (¥it+VatVat ate 
= (2 | (49. 
202 Vat Va 


—1 -—1\ /Xy +c c 
3 ( >] = (10.499. 
Pied Vat Va 2(Yotya)—€ 
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Since we chose B so that c = 6,, we can make no further progress in estimating @ itself. 
However, by (19.19), any set of linear functions C6 are unbiassedly estimated by Ty 


if TX =C. Thus (6,+6,) and (6,+63) are estimable, since C = (i : i) satisfies 


1g 
(19.19). wth JF = (5 0 1) The estimator of (6,+6,) is therefore 3(y,+73) and 
2 2 


that of (0,+6;) is 4(yo+7,). 
From (19.58), 


V (6) 


I 

Q 

nw 
PF aati 
ei 


I 

Q 

to 
ae itiae. 
S oS 
Om oOo =| =| = 
lie CO © 
5, aa 


so that 

var (6,+ 6,) = var 6, = var (6,+63) = var 6, = 02/2, 
as is evident from the fact that each estimator is a mean of two observations with 
variance o*. Also 

cov (6,+6,, 6,+65) = 0, 

a useful property which is due to the orthogonality of the second and third columns 
of X. When we come to discuss the application of LS theory to the Analysis of Vari- 
ance in Volume 3, we shall be returning to this subject. 


A more general linear model 


19.17 ‘The LS theory which we have been developing assumes throughout that 
(19.10) holds, i.e. that the errors are uncorrelated and have constant variance. There 
is no difficulty in generalizing the linear model to the situation where the dispersion 
matrix of errors is o?V, V being non-singular, and we find (the details we left to the 
reader in Exercises 19.2 and 19.5) that (19.24) generalizes to 


t= CK’ V-1X)-1X’V-1y, (19.59) 
and that this is the MV unbiassed linear estimator of C@. Further, (19.26) becomes 
Vip = SCN’ V3) CC’. (19.60) 


In particular, if V is diagonal but not equal to I, so that the e; are uncorrelated but 
with unequal variances, (19.59) provides the required set of estimators. 

To use these equations, of course, we need to know V. In practical cases this is 
often unknown and the estimation problem then becomes much more difficult. 


Ordered Least Squares estimation of location and scale parameters 

19.18 A particular situation in which (19.59) and (19.60) are of value is in the 
estimation of location and scale parameters from the order-statistics, i.e. the sample 
observations ordered according to magnitude. The results are due to Lloyd (1952) 
and Downton (1953). 7 


G 


88 THE ADVANCED THEORY OF STATISTICS 


We denote the order-statistics, as previously (14.1), by ya), +--+ +»Xn AS 
usual, we write « or o for the location and scale parameters to be estimated (which 
are not necessarily the mean and standard deviation of the distribution), and 


Zn = (Yn—-F)/9 2.2 See (19.61) 
Let 
E(z) = a, 
bas me (19.62) 


where z is the (nx 1) vector of the 2). Since z has been standardized by (19.61), 
a and V are independent of mw and o. 


Now, from (19.61) and (19.62), 


E(y) = w1toa, (19.63) 
where y is the vector of yj) and 1 is a vector of units, while 
Vi(y) = o? V. (19.64) 


We may now apply (19.59) and (19.60) to find the LS estimators of uw and o. We 
have 


(4) = {diay vd ja}dia'voy (19.65) 
and 
v(5) = o°{(1{a)’V-1(1:a)}-. (19.66) 
Now | 
; ry 1 fe 
toi? 4 185 = 
{(1 i a) V (1 a) } # Va a’ a 
1/«V a. -—l Ve 
2 ere 1’V-11 ) Sega 
where A = {(1'V-1)(a’ V-1a)—(1' V4.4)? }. (19.68) 
From (19.65) and (19.67), 
i= —a’V-1(1e’—al’)V-1y/A, 
6=1'V-1(1a'-al’)V1y/A. — 


From (19.66) and (19.67) 


var 6 al’ ¥= 1A; (19.70) 


cov (fi, 6) = —o? 1’ V-1a/A. 


var i = hay 


19.19 Now since V and V- are positive definite, we may write 


v= TT, 
a ea (19.71) 


so that for an arbitrary vector b 
b’ Vb = b’TT’b = (T’b)'(T’b) = & hj, 
i=1 


where h; is the ith row element of the vector T’b. 
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Similarly, for a vector c, : 
eV“ = fF *c} (ic) — "= F, 
i=1 
k, being the element of T-!c. Now by the Cauchy inequality, 
LAUR; = b’Vb.c’ Ve > (Zh, k,)? = {(T’b)'(T-1c)}?2 = (b’c). (19.72) 
In (19.72), put 


b = (V-1—]) "| (19.73) 
c= a. 
We obtain 
V(V4-1)V(V1-]I)1.0’Vt4a > {1'(V-!—-Da}?. (19.74) 


It is easily verified that 


l'a = 0. ae 


Using (19.75) in (19.74), it becomes 
(1'V-41-—n)a’V—a > (1'V-a)?, 
which we may rewrite, using (19.70) and (19.68), and now interpreting o? as the variance, 
var &@ < o*/n = vary. (19.76) 


%= i= = 


(19.76) is obvious enough, since ¥, the sample mean, is a linear estimator and therefore 
cannot have variance less than the MV estimator fi. But the point of the above argu- 
ment is that it enables us to determine when (19.76) becomes a strict equality. This 
happens when the Cauchy inequality (19.72) becomes an equality, i.e. when h; = Ak; 
for some constant A, or 

b= AT-c. 
From (19.73) this is, in our case, the condition 

T’(V-!—I)1 = AT-a, 


or 
TT’ (V-1-D1 = Aa. (19.77) 

Using (19.71), (19.77) finally becomes 
(I—V)1 = da, (19.78) 
the condition that varfi = vary = o?/n. If (19.78) holds, we must also have, by the 


uniqueness of the LS solution, 
fi =%, (19.79) 
and this may be verified by using (19.78) on / in (19.69). 


19.20 If the parent distribution is symmetrical, the situation simplifies. For then 
the vector of expectations 


E (3) ay 
i oe 
E (&n)) bn 
has 
he = —Ansy—iy all 2, (19.80) 
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as follows immediately from (14.2). Hence 
ae’ V1 =1'Va=0 (19.81) 
and thus (19.69) becomes 
fe ame Sin ss See Sta F 


vat £ i=-6*/fF ¥=4, 


while (19.70) simplifies to 
var 6 = o7/a@¥= “} (19.83) 


cov (fi,c) = 0. 
Thus the ordered LS estimators @ and 6 are uncorrelated if the parent distribution 
is symmetrical, an analogous result to that for ML estimators obtained in 18.34. 


Example 19.10 
To estimate the midrange (or mean) yw and range o of the rectangular distribution 
dF (x)= dx/o, pw-to<gx< pth. 
Using (14.2), it is easy to show that, standardizing as in (19.61), 
a = E(sm) = fr/(n+1)}-3, (19.84) 
(19.85) 


and that the elements of the dispersion matrix V of the %,) are 
Vrs = r(n—s+1)/{(n+1)2(n—-2)}, res. 


The inverse of V is 
2 —-1 
at ee ee 
oie eee 0 
V-1 = (n+1)(n+2) eae ee ; (19.86) 
ae 
—-1] 2 
From (19.86), mee 
0 
0 
V’V- = (n4+1)(n+4+2) (19.87) 
0 
0 
1 
and, from (19.84) and (19.86), 
—1 
0 
0 
: (19.88) 


a’ V-1 = £(n+1)(n4+2) 


a a i a) 
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Using (19.87) and (19.88), (19.82) and (19.83) give 
A = 3(¥aytMm)s 
6 = (n+1)(Hm—Yay)/(n— 1), 
var fi = o?/{2(n+1)(n+2)}, (19.89) 
var 6 = 207/{(n—1)(n+2)}, 
cov (4i,d) = 0. 
Apart from the bias correction to 6, these are essentially the results we obtained by 


the ML method in Example 18.12. The agreement is to be expected, since 1) and 
Yn) are a pair of jointly sufficient statistics for ~ and o, as we saw in effect in Example 
dake 


19.21 As will have been made clear by Example 19.10, in order to use the theory 
in 19.18-20, we must determine the dispersion matrix V of the standardized order- 
statistics, and this is a function of the form of the parent distribution. This is in direct 
contrast with the general LS theory using unordered observations, discussed earlier 
in this chapter, which does not presuppose knowledge of the parentform. In Chapter 32 
we shall return to the properties of order-statistics in the estimation of parameters. 

The general LS theory developed in this chapter is fundamental in many branches 
of statistical theory, and we shall use it repeatedly in later chapters. 


Other methods of estimation 
19.22 We saw in the preceding chapter that, apart from the fact that they are 

functions of sufficient statistics for the parameters being estimated, the desirable pro- 
perties of the ML estimators are all asymptotic ones, namely : 

(i) consistency ; 

(ii) asymptotic normality ; and 

(111) efficiency. 
Evidently, the ML estimator, 6, cannot be unique in the possession of these properties. 
For example, the addition to 6 of an arbitrary constant C/n" will make no difference 
to its first-order properties if 7 is large enough. It is thus natural to inquire, as Neyman 
(1949) did, concerning the class of estimators which share the asymptotic properties 
of 6. Added interest is lent to the inquiry by the numerical tedium sometimes involved 
(cf. Examples 18.3, 18.9, 18.10) in evaluating the ML estimator. 


19.23 Suppose that we have s(> 1) samples, with n,; observations in the ith sample. 
As at 18.19, we simplify the problem by supposing that each observation in the ith 
sample is classified into one of k; mutually exclusive and exhaustive classes. If 2,; is 
the probability of an observation in the 7th sample falling into the jth class, we therefore 
have 


ki 
j=1 


and we have reduced the problem to one concerning a set of s multinomial distributions. 
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Let ,; be the number of ith sample observations actually falling into the jth class, 
and p;; = n;;/n; the corresponding relative frequency. The probabilities z,; are 
functions of a set of unknown parameters (0,,..., 0,). 

A function T of the random variables p,; is called a Best Asymptotically Normal 
estimator (abbreviated BAN estimator) of 0,, one of the unknown parameters, if 


(i) T({p.;}) is consistent for 0, ; 


(11) TJ’ is asymptotically normal as N= Xn, o; 
i=1 

(11) Z' is efficient; and 

(iv) OT'/0p,; exists and is continuous in p;; for all 7,. 

The first three of these conditions are precisely those we have already proved for 
the ML estimator in Chapter 18. It is easily verified that the ML estimator also 
possesses the fourth property in this multinomial situation. Thus the class of BAN 
estimators contains the ML estimator as a special case. 


19.24 Neyman showed that a set of necessary and sufficient conditions for an 
estimator to be BAN is that 
(i) T({s}) = 915 
(11) condition (iv) of 19.23 holds; and 


k 

(iii) > i > (5) 3 [x0 be minimized for variation in 07/dp;;. 
i=1 1; j=1 OD;; Pij=yj 

Condition (i) is enough to ensure consistency: it is, in general, a stronger condition 

than consistency.“ In this case, since the statistic T is a continuous function of the 

pix, and the p,; converge in probability to the 2,, T converges in probability to 

Fag}, be. © 0; 

Condition (iii) is simply the efficiency condition, for the function there to be 
minimized is simply the variance of T subject to the necessary condition for a 
minimum (=) Ry = 0. 

Z Pp ti/ Pig =I 

As they stand, these three conditions are not of much practical value. However, 
Neyman also showed that a sufficient set of conditions is obtainable by replacing (iii) 
by a direct condition on @7'/dp,;, which we shall not give here. From this he deduced 
that 

(a) the ML estimator is a BAN estimator, as we have already seen ; 

(b) that the class of estimators known as Minimum Chi-Square estimators are also 

BAN estimators. 


We now proceed to examine this second class of estimators. 


Minimum Chi-Square estimators 
19.25 Referring to the situation described in 19.23, a statistic T is called a Mini- 


“) In fact, (i) is the form in which consistency was originally defined by Fisher (1921a). 
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mum Chi-Square (abbreviated MCS) estimator of 6,, if it-is obtained by minimizing, 
with respect to 6,, the expression 


ee > aig Ye 8 ky p?2. 
ee (= 1), (19.91) 
i=1 1% j=1 aT) i=1 1; \j=1 Mis 
where the z,; are functions of 6,,...,6,. ‘To minimize (19.91), we put 
oy ae Pas 1 Pi 2 073; ES 
B, ~ ~ 2 a2(2) ae Oh 


and a root of (19.92), regarded as an equation in ,, is the MCS estimator of 6,. Evi- 
dently, we may generalize (19.92) to a set of r equations to be solved together to find 
the MCS estimators of 6,,..., 6,. 

The procedure for finding MCS estimators is quite analogous to that for finding 
ML estimators, discussed in Chapter 18. Moreover, the (asymptotic) properties of 
MCS estimators are similar to those of ML estimators. In fact, there is, with prob- 
ability 1, a unique consistent root of the MCS equations, and this corresponds to the 
absolute minimum (infimum) value of (19.91). The proofs are given, for the com- 
monest case s = 1, by C. R. Rao (1957). 


19.26 A modified form of MCS estimator is obtained by minimizing 


s j & (pis — 745)? 1 ( aa 
w= Y-—yYe" = r-[(r--1 19.93 
) i=1 Mija1 Pi ii \5 Pii ( ) 
instead of (19.91). In (19.93), we assume that no p;; = 0. To minimize it for varia- 
tion in 6,, we put 
A(x asl ys (Bu) Om _ 
la =P 6 26, = 0 (19.94) 


and solve for the estimator of 6,. These modified MCS estimators have also been 
shown to be BAN estimators by Neyman (1949). 


19.27. Since the ML, the MCS and the modified MCS methods all have the same 
asymptotic properties, the choice between them must rest, in any particular case, 
either on the grounds of computational convenience, or on those of superior sampling 
properties in small samples, or on both. As to the first ground, there is little that can 
be said in general. Sometimes the ML, and sometimes the MCS, equation is the more 
difficult to solve. But when dealing with a continuous distribution, the observations 
must be grouped in order to make use of the MCS method, and it seems rather wasteful 
to impose an otherwise unnecessary grouping for estimation purposes. Furthermore, 
there is, especially for continuous distributions, preliminary inconvenience in having 
to determine the z;; in terms of the parameters to be estimated. Our own view is 
therefore that the now traditional leaning towards ML estimation is fairly generally 
justifiable on computational grounds. The following example illustrates the MCS 
computational procedure in a relatively simple case. 


Example 19.11 
Consider the estimation, from a single sample of m observations, of the parameter 6 
of a Poisson distribution. We have seen (Examples 17.8, 17.15) that the sample mean x 
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is a MVB sufficient estimator of 0, and it follows from 18.5 that # is also the ML 
estimator. 
The MCS estimator of 0, however, is not equal to «, illustrating the point that 
MCS methods do not necessarily yield a single sufficient statistic if one exists. 
The theoretical probabilities here are 


a oe eee, Se Op ee 
so that 


The minimizing equation (19.92) is therefore, dropping the factor 1/n, 


=> (2:)'2 i(4- 1) . ahi (1-3) = 0); (19.95) 


This is the equation to be solved for 6, and we use an iterative method of solution 
similar to that used for the ML estimator at 18.21. We expand the left-hand side of 
(19.95) in a Taylor series as a function of 6 about the sample mean %, regarded as a 
trial value. We obtain to the first order of approximation 


nbi(1—f) = 2 H(1- 1) 4.¢- yn li{s +(1-2)}}, (19.96) 


where we have written m; = e-**//j!. If (19.96) is equated to zero, by (19.95), we 
find 


$f (j—-4) 
(0—*) = ee (19.97) 
FAIS a (GZ) 
a 
We use (19.97) to find an improved estimate of 6 from *, and repeat the process as 
necessary. 

As a numerical example, we use Whitaker’s (1914) data on the number of deaths 
of women over 85 years old reported in The Times newspaper for each day of 1910- 
1912, 1,096 days in all. The distribution is given in the first two columns of the 
table on page 95. 

The mean number of deaths reported is found to be € = 1295/1096 = 1-181569. 
This is therefore the ML estimator, and we use it as our first trial value for the MCS 
estimator. ‘The third column of the table gives the expected frequencies in a Poisson 
distribution with parameter equal to *, and the necessary calculations for (19.97) 
are set out in the remaining five columns. 

Thus, from (19.97), we have 


422 
6=1 1816 {1+ 55-4} 


= 1-198 


as ourimproved value. K. Smith (1916) reported a value of 1-196903 when working to 
greater accuracy, with more than one iteration of this procedure. 
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No. of Frequenc 2 2 esis np2 
ey eee ot) ‘ss ee = — (j—#) ili) 

0 364 336:°25 394-1 —1:1816 — 465-7 
1 376 397-30 355-8 — 00-1816 — 64:6 
2 218 234:°72 202:°5 0:8184 165°8 
3 89 92°45 85-69 1°8184 155-8 
+ 33 27°31 39°87 2°8184 112-4 
5 13 6°45 26:20 3°8184 100-0 
6 = 1:27 3°15 4-8184 15:2 
7 1 0:25 4-00 58184 acr5 

Total n= 1096 1096:00 + 42-2 


ee 

{7 +(3 —2)"} prec &)*} 
1-396 551:1 
1-033 365°9 
2:670 540-6 
6:307 540:4 
11-943 476:1 
19-580 512-9 
29-217 92:0 
40:854 163-4 
3242°4 


Smith also gives details of the computational procedure when we are estimating 
the parameters of a continuous distribution. This is considerably more laborious. 


19.28 Small-sample properties, the second ground for choice between the ML 
and MCS methods, are more amenable to general inquiry. C. R. Rao (1961, 1962a) 
defines a concept of second-order efficiency, and shows that in the multinomial model 
of 18.19, the ML is the only BAN estimator with optimum second-order efficiency, 
under regularity conditions. Berkson (1955, 1956) has carried out sampling experi- 
ments which show that, in a situation arising in a bio-assay problem, the ML estimator 
presents difficulties, being sometimes infinite, while another BAN estimator has smaller 
mean-square error. ‘These papers should be read in the light of another by Silver- 
stone (1957), which points out some errors in them. 


EXERCISES 


19.1 In the linear model (19.8), suppose that a further parameter 0, is introduced, 
so that we have the new model 


y = X0+16,+€ 
where 1 is an (n x1) vector of units. Show that the LS estimator in the new model of 9, 
the original vector of k parameters, remains of exactly the same form (19.12) as in the 
original model, with y; replaced by (y;—¥) and xi by (xij—%j) for 7 = 1, 2,..., n, and 
a So See 
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19.2 If, in the linear model (19.8), we replace the simple dispersion matrix (19.10) 
by a non-singular dispersion matrix 0? V which allows correlations and unequal variances 
among the &, show by putting w = T’y, where TT’ = V™, that the LS and MV un- 
biassed estimator of C@ is 


C6 = C(X’V- X)- X’V-"y. 
(cf. Aitken, 1935; Plackett, 1949) 


19.3. Generalizing (19.38), show that if E(ee’) = o? V, Ee’ Be) = o* tr(BV). Show 
further that var (e’ Be) = 20% tr (BVBV) if € is multinormal. 


19.4 Show that in 19.12 the ratio (X 6)’ (X 6)/(ks®) is distributed in Fisher’s F 
distribution with k and (n—k) degrees of freedom if 8 = 0. 


19.5 In Exercise 19.2, show that, generalizing (19.26), 
V (C6) = o?.C(X’ V-X)-!C, 
and that the generalization of (19.40) is, using Exercise 19.3, 
FE {e’ [V-!—V-1 X (KX’ V-1 X)-? X’V-" Je} = (n—k) 0°. 


19.6 Prove the statement in 19.15 to the effect that 6 and V (6) in the singular case 
are unaffected by replacing B by UB, where U is non-singular. 


19.7. Using (19.51), (19.54) and (19.55), show that (X’ X+ B’B)-'B’B = D(BD)"B 
and hence that (19.58) may be written 
Vv (6) (X’ X) = o? {I, -D(BD)-"B}. 
(Plackett, 1950) 


19.8 Using the result of Exercise 19.7, 
(X’ X+B’B)-!B’B = D(BD)—'B, 
modify the argument of 19.9 to show that the unbiassed estimator of o? in the singular 


1 A A 
case is ———~(y — X68)’ (y — X68). 
(n—71) 
(Plackett, 1950) 


19.9 For the linear model (19.8-10), show using 19.9 that the quadratic form 6’ A® 
is unbiassedly estimated by 0’AO—s? tr {(X’X)—1A}. 


19.10 Show that in the case of a symmetrical parent distribution, the condition that 
the ordered LS estimator fi in (19.82) is equal to the sample mean y = I’ y/1’1 is that 


_) oe a 
i.e. that the sum of each row of the dispersion matrix be unity. Show that this property 
holds for the univariate normal distribution. (Lloyd, 1952) 


19.11 For the exponential distribution 
dF (y) = xp {—-(?5*) dy/o, o>O0; wey<o, 
show that in (19.62) the elements of @ are 


r 
i=1 
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and that those of V are 
™ 

Vrs = & (n—i+1)- where m = min(r,s). 
i=1 


Hence verify that the inverse matrix is 
Pater, ety, 
—(n—1)*, (#—1)?+(—2)*, —(n—2)!, 


. "es 
, its 
° = 
oa e 
° 2 
° * 
° * 
° ° 
° . 
° * 
° ° 
° ° 
° * 
° ° 
* * 
e * 
° * 
* * 
> ® 
° * 
2, % 
cs ° 


*e 
° 
* 
* 
° 
* 
+ 
° 
> 
. 
J 
. 
> 
. 
e 
. 
> 
> 
> 
> 
*. 
*e ™, 
bd * 
° * 
° . 
nd * 
. * 
* * 
* ‘* 
* * 
* * 
™ 2 
*. % 
%. 


*. 
* 
. 
. 
> 
2 
> 
- 
2 
° 
. 
*. 
° 


* 
* * 2 
tee Se en so Se ae Soe ee 5 Oe ee See ee Se 
* *, 2 
* ° > 
* °, * 
a a Om Ng ie ee BE ee meee 
* 
of ee ee ae ee. ee. ee Pee 
” ad > 
» ° ~ 
ae SEE RE See See eee ee ee ee oes PS eee Eek 
& U *. 
%, *. 


19.12 In Exercise 19.11, show from (19.69) that the MV unbiassed estimators are 
A=yay-(9-—ya)/(n- 1),  & = AI — Yay)/(n—-1), 


and compare these with the ML estimators of the same parameters. 
(Sarhan, 1954) 


19.13 Show that when all the 1: are large the minimization of the chi-squared ex- 
pression (19.91) or (19.93) gives the same estimator as the ML method. 


19.14 In the case s = 1, show that the first two moments of the statistic (19.91) are 
given by 


b 
My j=1%15 My, 


ni E(x") = k, — 1, ; 
1 2 ae 
ni var (x7) = 2 (Rk, —1) oca py : 


ny 


and that for any c > 0, the generalization of (19.93) has expectation 


ky —.)2 ky 4 1 

B ps (scnt el oo ee ge [O-e+2) ee -G-9h+1| +0(;3) 
j=1 Myre ny j=l 1 = 

Thus, to the second order at least, the 2; disappear from the expectation if b = c—2. 


1 
if & = 0; ¢ =-2, it ts: @,—1) (1-2) and if b = 1, c = 3, it 1s (k,— 1) +7, which for 
1 


k, > 2 is even closer to the expectation of (19.91). 
(F. N. David, 1950; Haldane, 1955a) 


19.15 Fora binomial distribution with probability of success equal to 2, show that the 
MCS estimator of x obtained from (19.91) is identical with the ML estimator for any n ; 
and that if the number of successes is not 0 or n, the modified MCS estimator obtained 
from (19.93) is also identical with the ML estimator. 


19.16 4; is a Poisson variable with parameter 0xi, where x; is a constant observed 
with 4;, (i = 1,2,...,2). Show that the ML estimator of @ is Xy;/2Xx;, with asymptotic 
variance 0/Xxi, and that the LS estimator Uy; x;/Xx;? has exact variance 6 X xi°/(X x4)?. 
Hence show that the LS estimator is inefficient unless the x; are all equal. Explain the 
result. 


CHAPTER 20 
INTERVAL ESTIMATION : CONFIDENCE INTERVALS 


20.1 In the previous three chapters we have been concerned with methods which 
will provide an estimate of the value of one or more unknown parameters ; and the 
methods gave functions of the sample values—the estimators—which, for any given 
sample, provided a unique estimate. It was, of course, fully recognized that the 
estimate might differ from the parameter in any particular case, and hence that there 
was a margin of uncertainty. The extent of this uncertainty was expressed in terms 
of the sampling variance of the estimator. With the somewhat intuitive approach 
which has served our purpose up to this point, we might say that it is probable that 
6 lies in the range t+ +/(vart), very probable that it lies in the range t+2,/ (var 2), 
and soon. In short, what we might do is, in effect, to locate 9 in a range and not at 
a particular point, although regarding one point in the range, viz. t itself, as having 
a claim to be considered as the “‘ best” estimate of 6. 


20.2 In the present chapter we shall examine this procedure more closely and 
look at the problem of estimation from a different point of view. We now abandon 
attempts to estimate 9 by a function which, for a specified sample, gives a unique 
estimate. Instead, we shall consider the specification of a range in which 6 lies. Three 
methods, of which two are similar but not identical, arise for discussion. The first, 
known as the method of Confidence Intervals, relies only on the frequency theory of 
probability without importing any new principle of inference. The second, which we 
shall call the method of Fiducial Intervals, explicitly requires something beyond a 
frequency theory. ‘The third relies on Bayes’ theorem and some form of Bayes’ postu- 
late (8.4). In the present chapter we shall attempt to explain the basic ideas and 
methods of Confidence Interval estimation, which are due to Neyman—the memoir 
of 1937 should be particularly mentioned (see Neyman (19376)). In Chapter 21 we 
shall be concerned with the same aspects of Fiducial Intervals and Bayes’ estimation. 


Confidence statements 


20.3 Consider first a distribution dependent on a single unknown parameter 6 
and suppose that we are given a random sample of n values x,, . . . , ¥, from the popula- 
tion. Let z be a variable dependent on the x’s and on 0, whose sampling distribution 
is independent of 6. (The examples given below will show that in some cases at least 
such a function may be found.) Then, given any probability 1—«, we can find a value 
%, such that 

| WP = 1 n, 
and this is true whatever the value of 6. In the notation of the theory of probability 
we shall then have 
P(z 2 2,;) = 1-«. (20.1) 
98 
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Now it may happen that the inequality z > z, can be written in the form @ < ¢, or 
§ > t,, where t, is some function depending on the value z, and the x’s but not on 6. 
For instance, if z = *—0 we shall have 
K-0 > 2, 
and hence 
6 < X— 21. 
If we can rewrite this inequality in this way, we have, from (20.1), 
P(0 < t,) = 1-«. (20.2) 
More generally, whether or not the distribution of z is independent of 0, suppose 
that we can find a statistic t,, depending on 1—« and the x’s but not on 6, such that 
(20.2) is true for all 6. ‘Then we may use this equation in probability to make certain 
statements about 6. 


20.4 Note, in the first place, that we cannot assert that the probability is 1—« that 
6 does not exceed a constant t,. This statement (in the frequency theory of prob- 
ability) can only relate to the variation of 0 in a population of 6’s, and in general we 
do not know that 0 varies at all. If it is merely an unknown constant, then the prob- 
ability that 0 < ft, is either unity or zero. We do not know which of these values is 
correct, but we do know that one of them is correct. 

We therefore look at the matter in another way. Although 6 is not a random vari- 
able, ¢, is, and will vary from sample to sample. Consequently, if we assert that @ < ty 
in each case presented for decision, we shall be right in a proportion 1—« of the cases 
in the long run. The statement that the probability of 0 is less than or equal to some 
assigned value has no meaning except in the trivial sense already mentioned ; but the 
statement that a statistic ¢, is greater than or equal to 0 (whatever 0 happens to be) 
has a definite probability of being correct. If therefore we make it a rule to assert 
the inequality 0 < t, for any sample values which arise, we have the assurance of 
being right in a proportion 1—« of the cases “ on the average ” or “in the long run.” 

This idea is basic to the theory of confidence intervals which we proceed to develop, 
and the reader should satisfy himself that he has grasped it. In particular, we stress 
that the confidence statement holds whatever the value of 0: we are not concerned 
with repeated sampling from the same population, but just with repeated sampling. 


20.5 ‘To simplify the exposition we have considered only a single quantity ¢, and 
the statement that 6 < ¢,. In practice, however, we usually seek two quantities ty 
and ¢,, such that for all 0 

P(t, < 6 <.t,) = 1-2,” (20.3) 
and make the assertion that 6 lies in the interval ¢, to t,, which is called a Confidence 
Interval for 0. ¢, and ¢, are known as the Lower and Upper Confidence Limits respec- 
tively. They depend only on 1—« and the sample values. For any fixed 1—a, the 
totality of Confidence Intervals for different samples determines a field within which 6 


(*) We shall almost always write 1 —« for the probability of the interval covering the parameter, 
but practice in the literature varies, and « is often written instead. Our convention is nowadays 
the more common. 
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is asserted to lie. This field is called the Confidence Belt. We shall give a graphical 
representation of the idea below. The fraction 1 —« is called the Confidence Coefficient. 


Example 20.1 — 
Suppose we have a sample of size n from the normal population with known variance 
(taken without loss of generality to be unity) 


1 
dF = —__ exp {—4(x—)? }dx, —-o<cx< ow. 


The distribution of the sample mean # is 


s sd sy Le z = 
al = J(E)on{ 5 ht) hae, Og Ee oe 


From the tables of the normal integral we know that the probability of a positive 
deviation from the mean not greater than twice the standard deviation is 0-97725. We 
have then 

P(#—p < 2/4/n) = 0-97725, 
which is equivalent to 
P(#—2//n < pw) = 0-97725. 
Thus, if we assert that uw is greater than or equal to #—2 //n we shall be right in about 
97-725 per cent of cases in the long run. 
Similarly we have 
P(¥—-p > —2/s/n) = P(w < F+2/s/n) = 0-97725. 
Thus, combining the two results, 
P(#—-2//n < w < ¥4+2/4/n) = 1—2(1—0-97725) 
= 0-9545. (20.4) 
Hence, if we assert that u lies in the range *+2/4/n, we shall be right in about 95-45 
per cent of cases in the long run. 

Conversely, given the confidence coefficient, we can easily find from the tables of 

the normal integral the deviation d such that 
P(%—d/\/n <u < &+d//n) = 1-«. 
For instance, if 1—« = 0-8, d = 1-28, so that if we assert that ju lies in the range 
¥+1-28/4/n the odds are 4 to 1 that we shall be right. 


The reader to whom this approach is new will probably ask : but is this not a round- 
about method of using the standard error to set limits to an estimate of the mean ? 
In a way, itis. Effectively, what we have done in this example is to show how the use 
of the standard error of the mean in normal samples may be justified on logical grounds 
without appeal to new principles of inference other than those incorporated in the 
theory of probability itself. In particular we make no use of Bayes’ postulate (8.4). 

Another point of interest in this example is that the upper and lower confidence 
limits derived above are equidistant from the mean & This is not by any means 
necessary, and it is easy to see that we can derive any number of alternative limits for 
the same confidence coefficient 1 —«. Suppose, for instance, we take 1—a = 0-9545, 
and select two numbers «, and «, which obey the condition 

Agta, = a = 0-0455 
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say «a, = 0-01 and «, = 0:0355. From the tables of the normal integral we have 
P(é—p < 2:326/+/n) = 0-99 
P(é—-p > — 1:806//n) = 0-9645, 
and hence 
2°326 1-806 
P( #-—— <u < #+—_} = 0: : 20. 
(s Te << + a) 0.9545 (20.5) 
Thus, with the same confidence coefficient we can assert that u lies in the interval 
%—2/4/n to €4+2/4/n, or in the interval #—2:326/./n to £+1-806/,/n. In either 
case we shall be right in about 95-45 per cent of cases in the long run. 

We note that in the first case the interval has length 4/+/n, while in the second case 
its length is 4-132/4/n. Other things being equal, we should choose the first set of 
limits since they locate the parameter in a narrower range. We shall consider this 
point in more detail below. It does not always happen that there is an infinity of 
possible confidence limits or that the choice between them can be made on such clear- 
cut grounds as in this example. 


Graphical representation 

20.6 Ina number of simple cases, including that of Example 20.1, the confidence 
limits can be represented in a useful graphical form. We take two orthogonal axes, 
OX relating to the observed # and OY to wu (see Fig. 20.1). 
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Fig. 20.1—Confidence limits in Example 20.1 for n = 1 
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The two straight lines shown have as their equations 

pir, b= *—2, 
Consequently, for any point between the lines, 

#-2 <p < #+2. 

Hence, if for any observed # we read off the two ordinates on the lines corresponding 
to that value, we obtain the two confidence limits. The vertical interval between the 
limits is the confidence interval (shown in the diagram for # = 1), and the total zone 
between the lines is the confidence belt. We may refer to the two lines as upper and 
lower confidence lines respectively. 

This example relates to the case m = 1 in Example 20.1. For different values of n, 
there will be different confidence lines, all parallel to 1 = #, and getting closer to each 
other as m increases. ‘They may be shown on a single diagram for selected values 
of n, and a figure so constructed provides a useful method of reading off confidence 
limits in practical work. 

Alternatively, we may wish to vary the confidence coefficient 1—«, which in 
our example is 0-9545. Again, we may show a series of pairs of confidence lines, each 
pair corresponding to a selected value of 1—«, on a single diagram relating to some 
fixed value of n. In this case, of course, the lines become farther apart with increasing 
1—«. In fact, in many practical situations, we are interested in the variation of the 
confidence interval with 1—«, and we may validly make assertions of the form (20.3) 
simultaneously for a number of values of «: each will be true in the corresponding 
proportion of cases inthe longrun. Indeed, this procedure may be taken to its extreme 
form, when we consider all values of 1—« in (0, 1) simultaneously, and thus generate 
‘a confidence distribution ” of the parameter—the term is due to D. R. Cox (e.g. 1958b) : 
we then have an infinite sequence of simultaneous confidence statements, each con- 
tained within the preceding one, with increasing values of 1—«. 


Central and non-central intervals 


20.7 In Example 20.1 the sampling distribution on which the confidence intervals 
were based was symmetrical, and hence, by taking equal deviations from the mean, 
we obtained equal values of 

l—ay = P(t, < 0) 
and l—a, = P(6 < t,). 
In general, we cannot achieve this result with equal deviations, but subject always to 
the condition «+a, = «, %» and «, may be chosen arbitrarily. 

If « ) and a, are taken to be equal, we shall say that the intervals are central. In 
such a case we have 

Pi, > 0) = P@ > t,) = «/2. (20.6) 
In the contrary case the intervals will be called non-central. It should be observed 
that centrality in this sense does not mean that the confidence limits are equidistant 
from the sample statistic, unless the sampling distribution is symmetrical. 


20.8 In the absence of other considerations it is usually convenient to employ 
central intervals, but circumstances sometimes arise in which non-central intervals are 
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more serviceable. Suppose, for instance, we are estimating the proportion of some 
drug in a medicinal preparation and the drug is toxic in large doses. We must then 
clearly err on the safe side, an excess of the true value over our estimate being more 
serious than a deficiency. In such a case we might like to take «, equal to zero, so that 


P(@<t,)=1 
P(t) < 0) = 1-a, 


in order to be certain that 0 is not greater than ¢,. But if our statistic has a sampling 
distribution with infinite range, this is only possible with ¢, infinite, so we must content 
ourselves with making «, very close to zero. 

Again, if we are estimating the proportion of viable seed in a sample of material 
that is to be placed on the market, we are more concerned with the accuracy of the 
lower limit than that of the upper limit, for a deficiency of germination is more serious 
than an excess from the grower’s point of view. In such circumstances we should 
probably take «, as small as conveniently possible so as to be near to certainty about 
the minimum value of viability. This kind of situation often arises in the specification 
of the quality of a manufactured product, the seller wishing to guarantee a minimum 
standard but being much less concerned with whether his product exceeds expectation. 


20.9 Onasomewhat similar point, it may be remarked that in certain circumstances 
it is enough to know that P(t, < 6 < t,) > 1—a. We then know that in asserting 0 
to lie in the range t, to ¢, we shall be right in at least a proportion 1—« of the cases. 
Mathematical difficulties in ascertaining confidence limits exactly for given 1—«a, or 
theoretical difficulties when the distribution is discontinuous may, for example, lead 
us to be content with this inequality rather than the equality of (20.3). 


Example 20.2 


To find confidence intervals for the probability @ of ‘ success ’ 
attributes. 

In samples of size 1 the distribution of successes is arrayed by the binomial (y¥+a)”, 
where y = 1—ow. We will determine the limits for the case m = 20 and confidence 
coefficient 0-95. 

We require in the first instance the distribution function of the binomial. The 
table overleaf shows the functions for certain specified values up to w = 0-5 (the 
remainder being obtainable by symmetry). For the accurate construction of the con- 
fidence belt we require more detailed information such as is obtainable from the 
comprehensive tables of the binomial function referred to in 5.7. ‘These, however, 
will serve for purposes of illustration. 

The final figures may be a unit or two in error owing to rounding up, but that need 
not bother us to the degree of approximation here considered. 

We note in the first place that the variate p is discontinuous. On the other hand, 
we are prepared to consider any value of w in the range 0 tol. For given w we cannot 
in general find limits to p for which 1 — « is exactly 0-95 ; but we will take p to be 

H 


b] 


in sampling for 
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Proportion of | 
Successes w= 0-1 w= 0-2 w@ = 0:3 w=04 | @w=05 | 
p | | 
0:00 0:1216 0:0115 0:0008 —~ no 
0:05 0:3918 0:0691 0:0076 0:0005 — | 
0:10 0:6770 0:2060 0:0354 0:0036 0-0002 
0:15 0:8671 0:-4114 0:1070 0-0159 0-0013 
0:20 0:9569 0:6296 0:2374 0:0509 0-0059 
r= ZS 0:9888 0:8042 0:4163 0-1255 0-:0207 
0:30 0:9977 0:9133 0:6079 0:2499 0:0577 
0:35 0:9997 0:9678 0:7722 0:4158 0-1316 
0:40 1:0001 0:9900 0:8866 0:5955 0:2517 
0:45 1:0002 0:9974 0:9520 0-7552 0-4119 
0:50 coon 0:9994 0:9828 0:8723 0-5881 
0:55 —_ 0:9999 0:9948 0:9433 0-7483 
0:60 a 1:0000 0:9987 0:9788 0-8684 
0:65 — — 0:9997 0:9934 0-9423 
0:70 — om 0:9999 0:9983 0-9793 
0:75 — — — 0:9996 0-9941 
0-80 — — a 0-:9999 0-9987 
0:85 — — — — 0-9998 
0:90 — — — _ 1-0000 
0:95 ee — — — — 


the sample proportion which gives a confidence coefficient at least equal to 0-95, so 
as to be on the safe side. We will consider only central intervals, so that for given a 


we have to find w, and a, such that 
P(p < @,) 2 0-975, 


the inequalities for P being as near to equality as we can make them. 

Consider the diagrammatic representation of the type shown in Fig. 20.2. 

From the table we can find, for any assigned ow, the values w) and a, such that 
P(p > @,) > 0-975 and P(p < a) > 0-975. Note that in determining a, the distri- 
bution function gives the probability of obtaining a proportion p or less of successes, 
so that the complement of the function gives the probability of a proportion strictly 
greater than p. Here, for example, on the horizontal through o = 0-1 we find a = 0 
and w, = 0-25 from our table ; and fora = 0-4 we have a = 0-15anda, = 0-60. The 
points so obtained lie on stepped curves which have been drawn in. For example, when 
w = 0-3 the greatest value of wy such that P(p > aw) > 0-975 is 0-1. By the time a has 
increased to 0-4 the value of w, has increased to 0:20. Somewhere between is the marginal 
value of w@ such that P(p > 0-1) is exactly 0-975. If we tabulated the probabilities for 
finer intervals of @ these step curves would be altered slightly ; and in the limit, if we 
calculate values of @ such that P(p > wy) = 0-975 exactly we obtain points lying inside 
our present step curves. ‘These points have been joined by dotted lines in Fig. 20.2. 

The zone between the stepped lines is the confidence belt. For any p the prob- 
ability that we shall be wrong in locating @ inside the belt is at the most 0-05. We 
determine p, and p, by drawing a vertical at the observed value of p on the abscissa and 
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O 0:5 1:0 
Values of p 


Fig. 20.2—Confidence limits for a binomial parameter 


reading off the values where it intersects the appropriate lines giving w, and a. That 
these are, in fact, the required limits will be shown in a moment. 

We consider a more sophisticated method of dealing with discontinuities below 
(20.22). 


It is, perhaps, worth noticing that the points on the curves of Fig. 20.2 were con- 
structed by selecting an ordinate w and then finding the corresponding abscissae wy 
and w,. ‘The diagram is, so to speak, constructed horizontally. In applying it, how- 
ever, we read it vertically, that is to say, with observed abscissa p we read off two values 
Py and p, and assert that py < w < p,. It is instructive to observe how this change 
of viewpoint can be justified without reference to Bayes’ postulate. 


Considering the diagram horizontally we see that, for any given w, an observation 
falls in the confidence belt with probability >1—«. If and only if the observation 1s 
in the belt, the pair of values (po, p,) will contain between them the true value of a. 
Thus the latter event has probability >1—«, whatever the true value of a. 


Confidence intervals for large samples 
20.10 We have seen (18.16) that the first derivative of the logarithm of the Likeli- 
hood Function is, under regularity conditions, asymptotically normally distributed with 


zero mean and 
dlogL\ _ dlogL\?\ _—_—s_ , f@logL 
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We may use this fact to set confidence intervals for 0 in large samples. Writing 


ge /[e(eyy. ms 


so that y is a standardized normal variate in large samples, we may from the normal 
integral determine confidence limits for 6 in large samples if y is a monotonic function 
of 0, so that inequalities in one may be transformed to inequalities in the other. ‘The 
following examples illustrate the procedure. 


Example 20.3 
Consider again the problem of Example 20.1. We have seen in Example 17.6 
that in this case 


= 
SEX = nH), (20.9) 
so that 
logL _ 
meee = See n (20.10) 
and, from (20.7) and (20.8), 
yp = (f—p)V/n (20.11) 


is normally distributed with unit variance for large n. (We know, of course, that this 
is true for any n in this particular case.) Confidence limits for « may then be set as 
in Example 20.1. 


Example 20.4 
Consider the Poisson distribution whose general term is 


iia 


x 
We have seen in Example 17.8 that 


° Ce 


iS = "(@-2) (20.12) 
Hence 
logL nx 
a eee | 
logL\ _ n 
and E(-" SE) ok (20.13) 
Hence, from (20.7) and (20.8) 
y = (&—A)\/(n/A). (20.14) 


For example, with 1—« = 0-95, corresponding to a normal deviate +1-96, we have, 
for the central confidence limits, 

(#2) (n/A) = 41-96, 
giving, on solution for A, 


w—(284 =) a+ st =. 
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or , 


a (22, 
1 n nN 


the ambiguity in the square root giving upper and lower limits respectively. 
To order n-* this is equivalent to 
A = €4+1-96+./(%/n), (20.15) 
from which the upper and lower limits are seen to be equidistant from the mean &, 
as we should expect. 


20.11 The procedure we have used in arriving at (20.15) requires some further 
examination. If we have probabilities like 


P(O < 2) es fe <i, ¢>% 
they can be immediately “inverted ”’ so as to give 
P(t 2 6) or fit =< o). 
But we may encounter more complicated forms such as 
P{g (2,0) < 0} 
where g is, say, a polynomial in ¢ or 0 or both, of degree greater than unity. The prob- 
lem of translating such an expression into terms of intervals for 6 may be far from 


straightforward. 
Let us reconsider (20.14) in the form 
y = ni(*—A)/2}. (20.16) 
Take a confidence coefficient 1 — « and let the corresponding values of w be py, and yw, 1.e. 
Ply < p< yi} = 1-«. (20.17) 
Equation (20.16) may be written 
A2— (2% +n-1 py?) A+ x? = 0 (20.18) 
and if the intervals of y are central, that is to say, if yp = —¥, the roots in A of (20.18) 


are the same whether we put y = yy or y = y,. Moreover, the roots are always real. 
Let Ay, A, be the roots of the equation with y = yp, (or y,), and let A, be the larger. 
Then, as wy goes from — © to yo, A is seen from (20.18) to go from + oo to 4, ; as w goes 
from wp, to y,, A goes (downwards) from A, to 4); and as py goes from y, to + 0, A 
goes from A, to — oo. Thus 
P(yo <p < yi) = 1-« 
is equivalent to 
P(A, <A <A,) = 1-2, 
and our confidence intervals are of the kind required. 
It is instructive to consider this diagrammatically, as in Fig. 20.3. 
From (20.15) we see that, for given m and y, the curves relating 4 (ordinate) and « 
(abscissa) may be represented as 
(A—x)? = RA, (20.19) 
where is a positive constant. For varying k, these are parabolas with 4 = # as the 
major axis, passing through the origin. The line 4 = * corresponds to k = 0 or 
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n=00 
Np 
Ng 
Values of > 
' / ny 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
I 
I 
\ 
-— 
es 
Values of xX 


Fig. 20.3—Confidence parabolas (20.19) for varying k or n 


n = oo and we have shown two other curves (not to scale) for values 7, and n, (m, < mz). 
From our previous discussion it follows that, for given n, the values of A corresponding 
to values of w inside the range yy to y, lie inside the appropriate parabolas. It is also 
evident that the parabola for m, lies wholly inside the parabola for any smaller n. 

Thus, given any #, we may read off ordinate-wise two corresponding values of / 
and assert that the unknown A lies between them. ‘The confidence lines in Fig. 20.3 
have similar properties of convexity and nestedness to those for the binomial distribu- 
tion in Example 20.2. 


20.12 Let us now consider a more complicated case. Suppose we have a statistic ¢ 
from which we can set limits f, and ¢,, independent of 0, with assigned confidence coefh- 


cient 1—«. And suppose that 
t = af®?+b6*+c0+d, (20.20) 


where a, b, c, d are constants. Sometimes, but not always, there will be three real 
values of 0 corresponding to a value of t. How do we use them to set limits to 6? 

Again the position is probably clearest from a diagram. In Fig. 20.4 we graph 6 
as ordinate against ¢ as abscissa, again not to scale. 

We have supposed the constants to be such that the cubic has a real maximum and 
minimum, as shown. For various values of ¢, the cubic of equation (20.20) is trans- 
lated along the f-axis. ‘To avoid confusing the diagram we will suppose that only the 
lines for one value of m are shown. We also take a > 0. 

Now for a given value of t, say to, there will be a cubic, as shown in the diagram, 
such that for the area on the right a*+602+c6+d > t, and for the area on the left 
that cubic is < f). Similarly for ¢,. With the appropriate confidence coefficient we 
may then say that for an observed ¢, the limits to 6 are given by reading vertically 
along the ordinate at ¢. 
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B D 


O%6 


Values of 
fe) 


6, Values of t 
Fig. 20.4—Confidence cubics (20.20) (see text) 


We now begin to encounter difficulties. If we take a value such as that along AB 
in the diagram, we shall have to assert that 6 lies in the broken range 6, < 6 < 6, and 
6, <6 < 6, On the other hand, at CD we have the unbroken range 6; < 0 < 6. 

Devotees of pathological mathematics will have no difficulty in constructing further 
examples in which the intervals are broken up even more, or in which we have to assert 
that the parameter 0 lies outside a closed interval. (See Fieller (1954) and S. T. David 
(1954) for some cases of consequence.) Cf. also Exercise 28.21 below. 


20.13 The point to observe in such cases is that the statements concerning the 
intervals may still be made with exactitude. ‘The question is, are they still useful and 
do they solve the problem with which we began, that of specifying a neighbourhood 
of the parameter value ? Shall we, in fact, admit them as confidence intervals or shall 
we deny this name to them ? 

No simple answer to such questions has been given, but we may record our own 
opinion on the subject. 

(a) The most satisfactory situation is that in which the confidence lines are mono- 
tonic in the sense that an ordinate meets each line only once, the parameter then being 
asserted to lie inside a connected interval. Further desiderata are that, for fixed «, 
the confidence belt for any should lie inside the belt for any smaller 1; and that 
for fixed n, the belt for any (1—«) should lie inside that for larger (1—«). These 
conditions are obeyed in Examples 20.1 to 20.4. 

(b) Where such conditions are not obeyed, the case should be considered on its 
merits. Instances may arise where a disconnected interval such as that of Fig. 20.4 
occurs and is acceptable. Where possible, the confidence regions should be graphed. 
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The automatic “‘ inversion” of probability statements without attention to such points 
must be avoided. 


20.14 We may, at this stage, notice another point of a rather different kind which 
sometimes leads to difficulty. When considering the quadratic (20.18) we remarked 
that, under the conditions of the problem, the roots were always real. It may happen, 
however, that for some confidence coefficients we set ourselves an impossible task in 
the construction of real intervals. ‘The following example will illustrate the point. 


Example 20.5 


If x,,...,%, are a sample of m observations from a normal population with unit 
variance and mean y, the statistic v2 = X(x—y)? is distributed in the chi-squared form 
with 2 degrees of freedom. For assigned confidence coefficient 1—« we can determine 
y2, y7 (say as a central interval, to simplify the exposition) such that 


Ping < x7 < i} = 1-«. (20.21) 
Now if s? = X(x—)?/n we have the identity 
42 = U(x—p)? = n{s?+(%—p)? }, 
and hence the limits for (#—,)? are given by 


Me 92 < (gp)? < Hg (20.22) 
- —~ bb — = e e 


Now it may happen that s? is greater than y?/n, in which case (since 7 < x?) the 
inequality (20.22) asserts that (¥—,)? lies between two negative quantities. What are 
we to make of such an assertion ? 

The matter becomes clearer if we again consider a geometrical argument. Since y? 
now depends on two statistics, s and *« (which are, incidentally, independent), we require 
three dimensions to represent the position, one for wand one each forsand #. Fig. 20.5 
attempts the representation. 

The quantity x? is constant on the surfaces 


(%—)?+s? = constant. 


For fixed u (i.e. planes perpendicular to the y-axis) these surfaces intersect the plane 
4 = constant in a circle centred at (uw, 0). These centres all lie on the line in the 
(u, *) plane, with equation wu = *; and the surfaces of constant y? are cylinders with 
this line as axis. (They are not right circular cylinders ; only the sections perpendicular 
to the w-axis are circles.) 

Moreover, the cylinder for y? completely encloses that for v2, as illustrated in the 
diagram. Given now an observed «, s we draw a line parallel to the w-axis. If this 
meets each cylinder in two points, fo, Mo, for 72 and wy, “1, for y2, we assert that 
Moo < @ < Myo and fo, < w < Myx. (There are two intervals corresponding to the 
ambiguity of sign when we take square roots in (20.22).) 

The point of the present example is that the line may not meet the cylinders at all. 
The roots for uw of (20.22) are then imaginary. Such a situation cannot arise in, for 
example, the binomial case of Example 20.2, where every line parallel to the @ axis 
in the range 0 < p < 1 must cross the confidence belt. Apart from complications 
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S 
Fig. 20.5—Confidence cylinders (20.22) (see text) 


due to inverting inequalities such as we considered in 20.11 to 20.13, this usually hap- 
pens whenever the parameter 9 has a single sufficient statistic ¢ which is used to set 
the intervals. But it can fail to happen as soon as we use more than one statistic and 
go into more than two dimensions, as in the present example. 

In such cases, it seems to us, we must recognize that we are being set an impossible 
task.) We require to make an assertion with confidence coefficient 1—«, using these 
particular statistics, which is to be valid for all observed # ands. ‘This cannot be done. 
It can only be done for certain sets of values of # and s, those for which the limits of 
(20.22) are positive. For some specified « and s we may be able to lower our confidence 
level, increase the radii of the cylinders and ensure that the line through #, s does meet 
the cylinders. But however low we make it, there may always arise a sample, however 
infrequently, for which we cannot set bounds to u by this method. 

In our present example, the remedy is clear. We have chosen the wrong method 
of setting confidence intervals ; in fact, if we use the method of Example 20.1 and set 
bounds to — from the normal curve, no difficulty arises. # is then sufficient for y. 
In general, where no single sufficient statistic exists, the difficulty may be unavoidable 


(*) As we understand him, Neyman would say that such intervals are not confidence intervals 
in his sense. The conditions of 20.28 below are violated. Other writers have used the ex- 
pression for intervals obtained by inverting a probability statement without regard to these 
conditions. 
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and must be faced squarely, if not after our own suggested manner, then by some 
equally explicit interpretation. 


20.15 We revert to the approximation to confidence intervals for large samples 
discussed in 20.10. If it is considered to be too inaccurate to assume that y is normally 
distributed for the sample size m with which we are concerned, a closer approximation 
may be derived. In fact, we find the higher moments of y and use an expansion of 


the Cornish-Fisher type (6.25-6). Write, using (17.19), 


dlog L log L 
rs ee = eS -) = -E( = ) (20.23) 
_ dlogL 
J = ay (20.24) 
From (17.18), under regularity conditions, 
K(J) = 0 (20.25) 
whence eo( J) =. (20.26) 
We now prove that 
3 
th 3 © + 2B (° ee): (20.27) 
log L\ _ d*log L hae B 
= Sat 8B (“Se—) 32 ( = )+3var (Fe). (20.28) 
In fact, pee ; we have 
oS dlog L dlogL dlog L\§ 
ce 28 ( = )+2( = ) : (20.29) 


and differentiating 


es +8 (738°) 


00? 
we have 
= il log L log L dlogL 
0 = FE / = )+2( a ) (20.30) 
Eliminating EF {(0?log L/067)(dlog L/00) } from (20.29) and (20.30) we arrive at 
(20-27 1. 


Differentiating twice both relations for J given by (20.23) and eliminating 
E {(@? log L/06?) (dlog L/00)? } we find 
O71 dlog L log L dlogL\ _ OlogL\ _ 0" log L\? 
° 59 = = (TE 00 ) -8E( 068 00 5B ( 004 = 00? y- 


Using the relation 
ey ai = as log L dlog L d*log L 
nF a08 ‘ 2 E( a0 B( a04 ) 


and transferring to cumulants we arrive at (20.28). ‘The formulae are due to Bartlett 


(1953). 
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Using the first four terms in (6.54) with 7, = ], = 0, .we then have the statistic 


Se i 
_ 1 kg BBE) i237 SS 796.38) 
f 


24a 00 00 


which is, to the next order of approximation, normal with zero mean and unit variance. 
The first term is the quantity we have called y. The corrective terms involve the 
standardized cumulants of J which are equivalent to the cumulants of y. 


Example 20.6 


Let us consider the problem of setting confidence limits to the variance of a normal 
population. The distribution of the sample variance is known to be skew, and we 
can compare the exact results with those given by the foregoing approximation. 

Defining 


a coe 1 — y)\2 
st = Bue zy, 
we know that in samples of m the quantity s?/o? is distributed in the Type III form, 
or alternatively that ns?/o? is distributed as y? with n—1 degrees of freedom. 

Thus, for a sample of size 10 we have (since the upper and lower 5 per cent points 
of y? are 3-3251 and 16-9190) 


P {3.3251 << 


o2 


< 16-9190} = 0-90. 


The inequalities may be inverted to give 
P {05911 5? < o? < 3-001 5? } = 0-90. (20.32) 
For example, with s? = 1 the limits are 05911 and 3-001. 
We find, taking 6 = o°, 


dlogL n {1 
5 = —_ 4-3) (x—p)?-0 
a0.|SC«COG ti af) \, were. 
whence we confirm that 
peo oes} 0 (20.34) 
00 ; 
as required. It follows from Example 17.10 that 
n 
= pa (20.35) 
whence 
ol n | 
ee (20.36) 


Differentiating (20.33) twice, we find on taking expectations 


opi \-2n 
Be) =o (20.37) 
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Hence, from (20.27), (20.36) and (20.37) we have 
ks(J) = 2. (20.38) 


We will take the expansion (20.31) as far as x; only, obtaining 


r= IF) [ae =f se (artes 2-9-9) fain) 
= fF FE e@—moh- 2 Fae —mp—o} 45] (20.39) 


We shall replace &(x—)?/n by ns*/(n—1), which has the same mean value, and 


obtain 
r= filet) 


The first term gives us the confidence limits for 6 based on y alone. ‘The other terms 
will be corrective terms of lower order inn. We then have approximately from (20.40) 


erat ee Bee 
+5 Js E (niyo 1) 6a” +3) 
2 

ae ang 7 1+T Jet tems. (20.41) 
For example, with n = 10, 1—a« = 0-90, s? = 1, and T = +1-6449 (the 5 per cent points 
of the standardized normal distribution), we find for the limits of s?/6@ the values 
0:3403 to 1:6644 and hence limits for 6 of 0-6008 and 2:939. ‘The true values, as we 
saw at (20.32) above, are 05911 and 3-001. For so low a value as m = 10 the 
approximation seems very fair. 


Shortest sets of confidence intervals 

20.16 It has been seen in Example 20.1 that in some circumstances at least there 
exist more than one set of confidence intervals, and it is now necessary to consider 
whether any particular set can be regarded as better than the others in some useful 
sense. ‘The problem is analogous to that of estimation, where we found that in general 
there are many different estimators for a parameter, but that we could sometimes find 
one (such as that with minimum variance) which was superior to the rest. 

In Example 20.1 the problem presented itself in rather a specialized form. We 
found that for the intervals based on the mean & there were infinitely many sets of 
intervals according to the way in which we selected «, and «, (subject to the condition 
that «)+a, = «). Among these the central intervals are obviously the shortest, for 
a given range will include the greatest area of the normal distribution if it is centred 
at the mean. We might reasonably say that the central intervals are the best among 
those determined by &. 

But it does not follow that they are the shortest of all possible intervals, or even 
that such a shortest set exists. In general, for two sets of intervals c, and c,, those of c, 
may be shorter than those of c, for some samples and longer in others. 
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20.17 We will therefore consider sets of intervals which are shortest on the 
average. ‘That is to say, if 

) — {[,—fy (20.42) 

we require to minimize | 6 dF, where the integral is taken over all x’s and is therefore 


equivalent to 


\ | in, ...é, (20.43) 


We now prove a theorem, due to Wilks (1938b), which is very similar to the result 
of 18.16 that Maximum Likelihood estimators in the limit have minimum variance, 
namely that in a certain class of intervals the method of 20.10 gives those which are 
shortest on the average in large samples. 

Let h(x, 0) be a statistic which has a zero mean value and is such that the sum of a 
number of similar functions obeys the Central Limit Theorem. ‘That is to say, 


3 (56) 
SS aah 


is normally distributed in the limit with zero mean and unit variance. wy of equation 
(20.8) is a member of the class ¢, having h = dlog f(x,6)/00. We first show that 
the absolute average rate of change of w with respect to 0, for each fixed 0, is greater 
than that of any ¢ except in the trivial case 


(20.44) 


dlog f 
hh a 
=e ee rd 
Writing g(x,0) = , we have 
Ow . eee Ovarg 
00 wre 00 sii oe f” 
a dvarh 
00. «\/(nvar sac = 36 06 ~ Fark 06 \ eeP) 
Op\ _ 3 Ovarg | 
Hence E (3h) = axargy 2" (3) i (zg) aH } ~ -(20:46) 
Now E(g) = 0 by (20.25) and by (20.26) 
ee fs eee on ee ey 
B(#) = 2( > )- B( 59 = —varg. 
oy = 
Thus (20.46) becomes E (3 5 a) ava = /(nvar g). (20.47) 


Similarly, since E'(h) = 0, (20.45) gives 
Se ee oh 
E (3) a gf (2 i) E(S): (20.48) 


Since 0 = E(h) = | h fx 
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we have, differentiating under the integral sign, 
= _ {oh of 
0 = SE (h) = | ap faet [nz as, 
oh\ _ (oh ee fe ae 

whence E (3) a jars [ag ay = —cov(h, g). (20.49) 
Hence, from (20.47-20.49) 

ow ag 
*()| ~|#(a) 


By the Cauchy—Schwarz inequality, the factor in braces in (20.50) is positive unless h 
is a constant multiple of g, when the factor is zero. Excluding this case, we have 


E (3s) E (55) (20.51) 


which is our preliminary result. Now if A, is defined by 


(2n)-+ | “exp (—4s")dv = 4(1-2), 


= [Ay tvarhvarg)—|cov(ing)|}- 20.50) 


Va 


the confidence limits for 6, say t) and ¢,, obtained from y satisfy 
2g (x, 0)/V/(nvarg) = thy 
which we may write y(t)= -+A,. Similarly those obtained from ¢, say uy and 1, satisfy 


Lh (x, 0)/s/(nvarh) = Ag 


which we write C(u) = +A,. 
Taylor expansions about the true value 0 give 
0 0 
Ay = y(00)+(t-6:) (2) = £(6)+@—04)(S (20.52) 
00 @/ 00 gl 


where 0’, 6’’ are values in the neighbourhood of 6) which converge in probability to 
6, as m increases. Putting ¢ = u = 0, in (20.52), we find p(69) = ¢ (o). 


Heite (t—0,) (3), ee) GR (20.53) 


Now the derivatives in (20.53) will converge in probability to their expectations. Hence, 
from (20.51) and (20.53), we have for large n 


| t—6, | < |u—Oy|, 
so that the confidence limits ¢), ¢, are closer together, on the average, than any other 
limits uw, u, given by a member of the class (20.44). 


20.18 The result of 20.17 illustrates the close relation between the theory of 
confidence intervals and that of point estimation developed in Chapter 17. In 17.15-17 
we showed that the MVB for the estimation of 0 was equal to 1/E {(dlog L/06)? } 
and could be attained only by an estimator which was a linear function of dlog L/d0. 
It is natural to expect that interval estimates of 0 based on dlog L/00 should have the 
corresponding property of being shortest on the average. We have now seen that this 
is so in large samples. 
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20.19 Neyman (1937b) proposed to apply the phrase ‘‘ shortest confidence inter- 
vals’ to sets of intervals defined in quite a different way. As it is clear that such 
intervals are not necessarily the shortest in the sense of possessing the least length, 
even on the average, we shall attempt to avoid confusion by calling them ‘ most 
selective.” 

Consider a set of intervals 59, typified by 69, obeying the condition that 

P {6,c 0| 6} = 1-a, (20.54) 
where we write 6,c 6—that is, 6, ‘‘ contains’? 0—for the more usual ft) < 6 < fy. 
Let s, be some other set, typified by 6,, such that 

P {6,c 6|0} = 1-«. (20.55) 
Either set is a permissible set of intervals, as the probability is 1—« in both cases that 
the interval 6 contains 0. 

If now for every s; we have, for any value 6’ other than the true value, 

P {6, ¢ 0" | 0} < P {6 ¢ 0’ | 6}, (20.56) 
Sy is said to be most selective. 


20.20 The ideas underlying this definition will be clearer from a reading of 
Chapters 22 and 23 dealing with the theory of tests. We anticipate them here to the 
extent of remarking that the object of most selective intervals is to cover the true value 
with assigned probability 1—«, but to cover other values as little as possible. We 
may say of both sy and s, that the assertion 6 c 6 is true in proportion 1—« of the cases. 
What marks out s, for choice as the most selective set is that it covers false values less 
frequently than the remaining sets. 

The difference between this approach and the one leading to shortest intervals is 
that the latter is concerned only with the physical length of the confidence interval, 
whereas the former gives weight to the frequency with which alternative values of 
6 are covered. ‘The one concentrates on locating the true value 6 with the smallest 
margin of error; the other takes into account the desirability of excluding so far as 
possible false values of 0 from the interval, so that mistakes of taking the wrong value 
are minimized. It turns out that the “ selectivity ’’ approach is easier to handle mathe- 
matically, so that much more attention has been given to it. See Exercises 20.17—18 
below for a relationship between the two approaches. Madansky (1962) gives an 
example which illuminates the difference between them. Harter (1964) discusses 
other criteria for intervals, preferring one based on the mean squared deviation of 
the confidence limits from the true parameter value. 

Neyman himself has shown that most selective sets do not usually exist (for instance, 
if the distribution is continuous) and has proposed two alternative systems : 


(a) most selective one-sided systems (Neyman’s “‘ shortest one-sided ” sets) which 
obey (20.56) only for values of 6’—6 which are always positive or always negative ; 
(b) selective unbiassed systems (Neyman’s “short unbiassed’”’ sets) which obey 


the furth lati 
€ furtner reiation P {6c0|0} =1-a > P{6cO| 6’}. (20.57) 


These definitions, also, amount to a translation into terms of confidence intervals 
of certain ideas in the theory of tests, and we may defer consideration of them until 
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Chapter 23. We therefore need make no systematic study of ‘‘ optimum ”’ confidence 
intervals in this chapter. 


20.21 Tables and charts of confidence intervals 

(1) Binomial distribution—Clopper and Pearson (1934) give two central confidence 
interval charts for the parameter, for « = 0-01 and 0-05; each gives contours for 
n = 8 (2) 12 (4) 24, 30, 40, 60, 100, 200, 400 and 1000. ‘The charts are reproduced in 
the Biometrika Tables. Upper or lower one-sided intervals can be obtained for half 
these values of «. Incomplete B-function tables may also be used—see 5.7 and the 
Biometrika Tables. | 

Pachares (1960) gives central limits for « = 0-01, 0-02, 0-05, 0-10 and m = 55 (5) 100, 
and references to other tables, including those of Clark (1953) for the same values of 
a and m = 10 (1) 50. 

Sterne (1954) has proposed an alternative method of setting confidence limits for a 
proportion. Instead of being central, the belt contains the values of p with the largest 
probabilities of occurrence. Since the distribution of p is skew in general, we clearly 
shorten the interval in this way. Crow (1956) has shown that these intervals constitute 
a confidence belt with minimum total area, and has tabulated a slightly modified set of 
intervals for sample sizes up to 30 and confidence coefficients 0-90, 0-95 and 0:99. See 
also 20.23 below. 

(2) Poisson distribution—(a) The Biometrika Tables, using the work of Garwood 
(1936), give central confidence intervals for the parameter, for observed values 
x = 0 (1) 30 (5) 50 and « = 0-002, 0-01, 0-02, 0-05, 0-10. As in (1), one-sided intervals 
are available for «/2. (b) Ricker (1937) gives similar tables for x = 0(1)50 and 
« = 0-01, 0-05. (c) Przyborowski and Wilénski (1935) give upper confidence limits 
only for x = 0 (1) 50, « = 0-001, 0-005, 0-01, 0-02, 0-05, 0-10. (d) Crow and Gardner 
(1959) tabulate modified intervals of the Sterne-Crow binomial type for x = 0 (1) 300 
and « = -001, -01, -05, -10, -20. See also 20.23 below. 

(3) Variance of a normal distribution—Tate and Klett (1959) give the most 
selective unbiassed confidence intervals, and the physically shortest intervals, based on 
‘multiples of the sufficient statistic for « = 0-001, 0-005, 0-01, 0-05, 0-10 and = 3 (1) 30. 
The former are also given by Pachares (1961) for « = -01, -05, -10 and m—1 = 1 (1) 20, 
24, 30, 40, 60, 120; and by Lindley et al. (1960) for « = -001, -01, -05 and m—1 = 
1 (1) 100. 7 

(4) Ratio of normal variances—Ramachandran (1958) gives the most selective un- 
biassed intervals for « = 0:05 and n,—1, n,—1 = 2 (1) 4 (2) 12 (4) 24, 30, 40, 60. 

(5) Correlation parameter—F. N. David (1938) gives four central confidence interval 
charts for the correlation parameter p of a bivariate normal population, for « = 0-01, 
0-02, 0-05, 0-10; each gives contours for n = 3 (1) 8, 10, 12, 15, 20, 25, 50, 100, 200 
and 400. The Biometrika Tables reproduce the « = 0-01 and « = 0-05 charts. One- 
sided intervals may be obtained as in (1). 


Discontinuities 

20.22 In discussing the binomial distribution in Example 20.2, we remarked on 
the fact that, as the number of successes (say c) is necessarily integral, and the propor- 
tion of successes p (= c/n) therefore discontinuous, the confidence belt obtained is 
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not exact, but provides confidence statements of form P > 1—« instead of P = 1—«. 
By a rather peculiar device, we can always make exact statements of form P = 1—« 
even in the presence of discontinuity. ‘The method was given by Stevens (1950). 
In fact, after we have drawn our sample and observed c successes, let us from else- 
where draw a random number x from the rectangular distribution dF = dx,0 < x < 1, 
e.g. by selecting a random number of four digits from the usual tables and putting a 


decimal point in front. ‘Then the variate 
y =c+x (20.58) 


can take all values in the range 0 to +1 (assuming that four decimal places is enough 
to specify a continuous variate). If y 9 is some given value cy+%», we have, writing 
aw for the probability to be estimated, 


Pigs eq) = Pile > 6} + Ple = &) Pie > X,) 
= z (") a (1—w)"7+ (”) a (1 —a)"— (1 — x9) 


j=e+1\J 
=x, (") a (1—o)"4+(1—x») (") ow (1—ow)"~’. (20.59) 
j=eot1 \J j= \J 

This defines a continuous probability distribution for y. It is clearly continuous as 
x) moves from 0+ to 1—, for cy) is then constant. And at the points where x, = 0 
the probability approaches the same value from the left and from the right. We can 
therefore use this distribution to set confidence limits for @ and our confidence state- 

ments based upon them will be exact statements of form P = 1—a. 
The confidence intervals are of the type exhibited in Fig. 20.6. The upper limit 
is now shifted to the right by amounts which, in effect, join up the discontinuities by a 
series of arcs. The lower limit also has a series of arcs, but there is no displacement 
to the right, and we have therefore shown on the diagram only the (dotted) approximate 


] 


Values of 


0-5 


0 5 10 IS 20 
Number of successes plus random element 
Fig. 20.6—Randomized confidence intervals for a binomial parameter 
I 
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upper limit of Fig. 20.2. On our scale the lower approximate limit would almost coin- 
cide with the lower series of arcs. The general effect is to shorten the confidence interval. 


20.23 It is at first sight surprising that the intervals set up in this way lie inside 
the approximate step-intervals of Fig. 20.2, and are therefore no less accurate ; for 
by taking an additional random number x we have imported additional uncertainty into 
the situation. A little reflection will show, however, that we have not got something 
for nothing. We have removed one uncertainty, associated with the inequality in 
P > 1—«, by bringing in another so as to make statements of the kind P=1—a; and 
what we lose on the second we more than offset by removing the first. 

Central intervals for the binomial parameter may easily be derived by use of (20.59), 
but they are not the most selective unbiassed randomized intervals. he latter are 
tabulated by Blyth and Hutchinson (1960) for « = -01, -05, and.n = 2 (1) 24 (2) 50. 
The same authors (1961) give intervals with the same property for the Poisson para- 
meter, for observed x ranging to 250 and « = -01, -05. 


Generalization to the case of several parameters 

20.24 We now proceed to generalize the foregoing theory to the case of a distribu- 
tion dependent upon several parameters. Although, to simplify the exposition, we 
shall deal in detail only with a single variate, the theory is quite general. We begin 
by extending our notation and introducing a geometrical terminology which may be 
regarded as an elaboration of the diagrams of Fig. 20.1 and 20.2. 

Suppose we have a frequency function of known form depending on / unknown 
parameters, 6,,...,0,, and denoted by f(x, 0;,...,6:). We may require to estimate 
either 6, only or several of the 6’s simultaneously. In the first place we consider only 
the estimation of a single parameter. 'To determine confidence limits we require to find 
two functions, uw, and u,, dependent on the sample values but not on the 6’s, such that 

P {u, = 0, < u,} = 1-2, (20.60) 
where 1—« is the confidence coefficient chosen in advance. 

With a sample of n values, x1, ..., %,, We can associate a point in an -dimensional 
Euclidean space, and the frequency distribution will determine a density function for 
each such point. The quantities u, and u,, being functions of the x’s, are determined 
in this space, and for any given 1—« will lie on two hypersurfaces (the natural extension 
of the confidence lines of Fig. 20.1). Between them will lie a Confidence Zone. 

In general we also have to consider a range of values of 6 which are a priort possible. 
There will thus be an /-dimensional space of 6’s subjoined to the n-space, the total 
region of variation having (J+) dimensions; but if we are considering only the 
estimation of 0,, this reduces to an (m+1)-space, the other (/—1) parameters not 
appearing. 

We shall call the sample-space W and denote a point whose co-ordinates are 
X 4, +++5X,_ by E. We may then write 1, (E), u; (E) to show that the confidence functions 
depend on E. The interval u,(E)—u,)(E) we denote by 6(£) or 6, and as above we 
write 6c, to denote uy < 0, < u,. The confidence zone we denote by 4, and may 
write Ee 6 or Ee A to indicate that the sample-point lies in the interval 6 or the 
region A. 
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20.25 In Fig. 20.7 we have shown two axes, x, and x,, and a third axis correspond- 
iny to the variation of 0,. ‘The sample-space W is thus two-dimensional. For any 
given 0,, say 0;, the space W is a hyperplane (or part of it), one such being shown. 

‘Take any given pair of values (x,, x.) and draw through the point so defined a line 
parallel to the 0,-axis, such as PO in the figure, cutting the hyperplane at R. ‘The 
two values of wu) and uw, will give two limits to 0, corresponding to two points on this 


or 


Fig. 20.7—Confidence intervals for n = 2 (see text) 


line, say U, V. Consider now the lines PO as x,, x, vary. In some cases U, V will 
lie on opposite sides of R, and 0, lies inside the interval UV. In other cases (as for 
instance in U’V’ shown in the figure), the contrary is true. ‘The totality of points in 
the former category determines the region in A, shaded in the figure. If for any point 
in A we assert 6 c 6’, we shall be right; if we assert it for points outside A we shall be 
wrong. 


20.26 Evidently, if the sample-point EF falls in the region A, the corresponding 6, 
lies in the confidence interval, and conversely. It follows that the probability of any 
fixed 0, being covered by the confidence interval is the probability that E lies in A (6) ; 
or in symbols— 

Pided, (0, 4.59, re Se O-S u,1 0;, «03 
= PireAG@,) 04.7.5 G- (20.61) 
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From this it follows that if the confidence functions are determined so that 
P {uy < 0, < u,}= 1-—a« 
we shall have, for all 6,, 
P {EF €A(8,)| 93,...,6,} = 1-«. (20.62) 
It follows also that for no 6, can the region A be empty, for if it were the probability in 
(20.62) would be zero. 


20.27 If the functions u, and u, are single-valued and determined for all E, then 
any sample-point will fall into at least one region A(6,). For on the line PQ corres- 
ponding to the given E we take an R between U and J, and this will define a value of 
6,, say 0;, such that E € A (6). 

More importantly, if a sample-point falls in the regions A (6/) and A (9;') correspond- 
ing to two values of 6,, 0, and 67, it will fall in the region A(0;"), where 6," is any 
value between 6; and 6;'.. For we have 


‘r 


Uy < 0, < Wy, Ug <9, < My, 
and hence 
lo <0, <0," < OY <u 
if 6,’ is the greater of 0, and 6. 
Further, if a sample-point falls in any of the regions A (6,) for the range of 6-values 
0, < 0, < 6/ it must also fall within A(6,) and A(6;). 


20.28 The conditions referred to in the two previous sections are necessary. We 
now prove that they are sufficient, that is to say : if for each value of 0, there is defined 
in the sample-space W a region A such that 

(1) P{E © A(6,) | 6,} = 1—«, whatever the value of the 6’s; 

(2) for any E there is at least one 0,, say 0}, such that EF ¢ A(6;) ; 

(3) if Ee A(6;) and Ee A(6/), then E € A(0;’) for any 61'" between 6, and ie 

(4) if Ee A(6,) for any 0, satisfying 0, < 0, < 0/, Ee A(O;) and Ee A (6;); 
then confidence limits for 0, u) and uw, are given by taking the lower and upper bounds 
of values of 6, for which a fixed sample-point falls within A(0,). ‘They are determinate 
and single-valued for all E, uo < uy, and P {ug < 0, < u,|0,} = 1—« for all 6;. 

The lower and upper bounds exist in virtue of condition (2), and the lower is not 
greater than the upper. We have then merely to show that P {uy < 0, < u,|6,} = 1-« 
and for this it is sufficient, in virtue of condition (1), to show that 

P{u,. < 0, <u; | 0;} = P{EeA@,) | 673. (20.63) 
We already know that if E ¢ A(0,) then uw) < 0, < u,; and our result will be established 
if we demonstrate the converse. 

Suppose it is not true that when uw» < 0, < u,, Ee A(6;). Let E’ be a point 
outside A(0,) for which uy < 0, < uy. Then either wp = 9,, or u, = 9,, or both: 
for otherwise, uw, and u, being the bounds of the values of 0, for which E lies in A (9,), 
there would exist values 6; and 6{’, such that E € A(6,) and Ee A(@;) and 

Uy <9, <0, < OY < my, 
so that, from condition (3), E ¢ A(6,), which is contrary to assumption. 
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Thus uy = 6, or u, = 0, 0r both. If both, then ZF must fall in A (6;), for uy and u, 
are the bounds of 6-values for which this is so. Finally, if uw) = 0, < wu, (and similarly 
if Uy < 6, = u,) we see that for uy) < 6, < u,, E must fall in A (0,) from condition (3), 
and hence, from condition (4), # must fall in A(6;) and A(6;’) where 6; = uy and 
@; == 4,4. Hence it falls in 4(8,). 


Choice of statistic 

20.29 ‘The foregoing theorem gives us a formal solution of the problem of finding 
confidence intervals for a single parameter in the general case, but it does not provide a 
method of finding the intervals in particular instances. In practice we have four lines 
of approach : (1) to use a single sufficient statistic if one exists ; (2) to adopt the process 
known as “‘ studentization ”’ (cf. 20.31) ; (3) to ‘‘ guess ” a set of intervals in the light 
of general knowledge and experience and to verify that they do or do not satisfy the 
required conditions ; and (4) to approximate by an extension of the method of 20.15. 


20.30 Consider the use of a single sufficient statistic in the general case. If ¢, is 

sufficient for 6,, we have 
L — g(t, | 0,) Le (*1, eeey Xn x coe ey 01). (20.64) 
The locus ¢, = constant determines a series of hypersurfaces in the sample-space W. 
If we regard these hypersurfaces as determining regions in W, then t, < , say, deter- 
mines a fixed region K. ‘The probability that E falls in K is then clearly dependent 
only on ¢, and 0,. By appropriate choice of k we can determine K so that 
P{EeK|6,} = 1-« 

and hence set up regions based on values of ¢;. We can do so, moreover, in an infinity 
of ways, according to the values selected for «) and «,. We shall see in 23.3, when 
discussing this problem in terms of testing hypotheses, that the most selective intervals 
(equivalent to the most powerful test of 6, = 62) are always obtainable from the sufficient 
statistics. 


Studentization 

20.31 In Example 20.1 we considered a simplified problem of estimating the mean 
in samples from a normal population with known variance. Suppose now that we 
require to determine confidence limits for the mean yu in samples from 


St 1 /x—p\? 
a = Faye {3 (a) 1 
when o is unknown. 


Consider the distribution of z = (*—,)/s, where s? is the sample variance. ‘This 
is known to be the “ Student” form 


kdz 
= (1 +22)” 
(cf. Example 11.8). Given «, we can now find 2» and 2,, such that 
—2Z; ee) Oo 
tn? Sa 


(20.65) 
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and hence 
P(—2, < 2 < 2) = 1-a, 

which is equivalent to 

P(&—s%o < WKH + 52,) = 1-«. (20.66) 
Hence we may say that yu lies in the range €—szy) to +52, with confidence coefficient 
1—«, the range now being independent of either wor o. In fact, owing to the symmetry 
of ‘* Student’s ”’ distribution, 7) = %,, but this is an accidental circumstance not neces- 
sary to the argument. 

“It should be noted that (20.66), like (20.4), is linear in the statistic * ; the confidence 
lines in this case also are parallel straight lines as in Fig. 20.1. ‘The difference is that 
whereas, with o known, the vertical distance between the confidence lines is fixed as a 
function of o, in the present case the distance is a random variable, being a function of s. 
Thus we cannot here fix the length of the confidence interval in advance of taking the 
observations. 


20.32 ‘The possibility of finding confidence intervals in this case arose from our 
being able to find a statistic z, depending only on the parameter under estimate, whose 
distribution did not contain o. A scale parameter can often be eliminated in this way, 
although the resulting distributions are not always easy to handle. If, for instance, 
we have a statistic t which is of degree p in the variables, then ¢/s? is of degree zero, 
and its distribution must be independent of the scale parameter. When a statistic 
is reduced to independence of the scale in this way it is said to be ‘ studentized,” 
after ‘ Student” (W. S. Gosset), who was the first to perceive the significance of 
the process. 


20.33 It is interesting to consider the relation between the studentized mean- 
statistic and confidence zones based on sufficient statistics in the normal case. ‘The 
joint distribution of mean and variance in normal samples is (Example 11.7) 


1 2 
dF = (523)! exp { - ac wp has en {~ 5a} ds? (20.67) 
and *, s are jointly sufficient (Example 17.17). In the sample space W the regions of 
constant « are hyperplanes and those of constant s are hyperspheres. If we fix % and s 
the sample-point E lies on a hypersphere of (n—2) dimensions (Example 11.7). Choose 
a region on this hypersphere of content 1—«. ‘Then the confidence zone A will be 
obtained by combining all such areas for all * and s. 

One such region is seen to be the “ slice ” of the sample-space obtained by rotating 
the hyperplane passing through the origin and the point (1, 1,..., 1) through an angle 
a (1—«) (not 2x(1—«) because a half-turn of the plane covers the whole space). 

The situation is illustrated for nm = 2 in Fig. 20.8. 

For any given yw’ the axis of rotation meets the hyperplane uw = yw’ in the point 
x, = xX, = mw’, and the hypercones (#—,)/s = constant in the W space become the 
plane areas between two straight lines (shaded in the figure). A set of regions A is 
obtained by rotating a plane about the line x, = x, = wu through an angle so as to cut 
off in any plane uw = pw’ an angle $7(1—«) on each side of 

XM = h_ — Wl. 
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Fig. 20.8—Confidence intervals based on “ Student’s”’ ¢ for n = 2 (see text) 


The boundary planes are given by 
x, —p = (gp) tan (fr—48), 
xy —f = (%,—y) tan (4a + 3A), 
where B = 2a; or, after a little reduction, 
f= ¥(%1+%2) +3 (%1—X2) cot gf, 
Me = ¥ (xy +X) — 3 (%1— x2) cot of. 
ju then lies in the region of acceptance if 
2 (%y+%2)—} | H1—X_ | cotZB < w < F(x, +H) +3 | ¥1—%, | cot gp. 
These are, in fact, the limits given by ‘“‘ Student’s ” distribution for n = 2, since the 
sample standard deviation then becomes }4|x,—x,| and 


i @ z 
- | a= == 4—tan-12 5) = a/2 = B/(2z) 
so that 29 = tan(4a—4f8) = cothP. 


20.34 As a further example of the use of studentization in setting confidence 
intervals, and the results to which it may lead, we consider Example 20.7. 


Example 20.7—Confidence intervals for the ratio of means of two normal variables 


Let x, y be jointly normal variables with means &, 7 and unknown variances and 
covariance. Suppose that é is large enough for the range of x to be effectively positive. 
Consider the setting up of confidence intervals for the ratio 6 = 7/& based on the 
statistic ¥/*. We have 
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P(2 < a) 2 yas <= &. (20.68) 


Now the quantity y — 0x is itself normally distributed and in a sample of n observa- 
tions the mean ¥—0O% is also normally distributed with variance 


(var y — 26 cov (x, y) + 6? var x) /n. 
Hence (cf. 16.10) the ratio 


= (¥ — 0%) V/(n—1) 
es {var y —20 cov (x, y) +6? Var x }# meee 


is distributed as “‘ Student’s”’ t with n—1 degrees of freedom, if the denominator 
terms are estimated from the sample by formulae of the type &(*—«)?/n. 

This result is due to Fieller (1940). We may find critical values of ¢ from the 
tables in the usual way, and the question is then whether, from (20.69), we can assign 
corresponding limits to 0. ‘There is now no single sufficient statistic for 6. Our 
equation depends on the set of five sufficient statistics consisting of two means and 
three (co)variance terms, which are to some extent dependent. We may therefore 
expect some of the difficulties we have previously encountered in 20.12-—20.14 to appear 
here; and in fact they do so. 

Let us consider how 6 and ?? vary for assigned values of the five statistics. We have 


2 _ Var x — 2xy cov (x, y) + x? vary 
n—t-— var x var y— { cov (x,y) }” 
[ {5 cov (x,y) — vary} — OF var w— &cOv(x,9) } P (20.70) 


~ [var x vary — {cov (x, y) 9] {vary — 26 cov (x, vy) + 62 var x} 


(20.70) is a cubic in 6 and #¢?. If we graph with 6 as ordinate and 7? as abscissa we get a 
figure similar to that of Fig. 20.9 (which, it may be as well to note, is not a confidence 
diagram). 

The maximum value (say ??,,,) of ¢? is, from (20.70), attained when 


pe a (20.71) 
¥ Var x — & cov (x, vy) : 


The minimum value is at #2 = 0. The line #2/(n—1) = #2/var x is an asymptote. 

Thus for 2? = 0 or 7%,,,, the two values of 0 coincide. For f? > #%,,, they are 
imaginary. For 0 < ¢? < A they are real and distinct. As ¢? goes from 0 to 7,,, the 
larger root # increases monotonically (or decreases so) from the observed value 7/* to 
A, while the smaller root decreases (or increases) from ¥/*, becomes infinite at the 
asymptote, reappears with changed sign on the opposite side, and monotonically 
approaches A to rejoin the other root. The limits for 0 corresponding to a given 
critical value of ¢? are indicated in Fig. 20.9 (adapted from Fieller, 1954). For specified 
values of *, §, Var x, COv(x, y) and vary, we may assert that 6 lies inside a real interval 
for confidence coefficients giving #?/(n—1) in the range 0 to *?/varx; that it lies in 
an interval which is infinite in the negative direction for ¥2/varx < t2/(n—1) < A; and 
only that it lies somewhere in the interval — co to + o for #?/(n—1) > A. 
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Fig. 20.9—Confidence intervals based on (20.71) (see text) 


Simultaneous confidence intervals for several parameters 
20.35 Cases fairly frequently arise in which we wish to estimate more than one 
parameter of a population, for example the mean and variance. ‘The extension of the 
theory of confidence intervals for one parameter to this case of two or more parameters 
is a matter of very considerable difficulty. What we should like to be able to do, given, 
say, two parameters 6, and 0, and two statistics ¢ and uw, is to make simultaneous interval 
assertions of the type 
P ft, < 0, < t, and wu, < 6, < u,} = 1—«. (20.72) 
This, however, is rarely possible. Sometimes we can make a statement giving a 
confidence region for the two parameters together, e.g. such as 
P {wy < & + & < w} = 1-«. (20.73) 
But this is not entirely satisfactory ; we do not know, so to speak, how much of the 
uncertainty of the region to assign to each parameter. It may be that, unless we are 
prepared to lay down some new rule on this point, the problem of locating the para- 
meters in separate intervals is insoluble. 
Even for large samples the problems are severe. We may then find that we can 
determine intervals of the type 


P {t,(0.) < 6, < #,(0,) } = 1—« 
and substitute a (large sample) estimate of 0, in the limits ¢,(0,) and ¢,(@,). This is 


very like the familiar procedure in the theory of standard errors, where we replace 
parameters occurring in the error variances by estimates obtained from the samples. 
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20.36 We shall not attempt to develop the theory of simultaneous confidence 
intervals any further here. ‘The reader who is interested may consult papers by S. N. 
Roy and Bose (1953) and S. N. Roy (1954) on the theoretical aspect. Bartlett (1953, 
1955) discussed the generalization of the method of 20.15 to the case of two or more 
unknown parameters. Halperin and Mantel (1963) and Halperin (1964) consider 
intervals for non-linear functions of parameters, especially in large samples. 

The theorem of 20.17 concerning shortest intervals was generalized by Wilks 
and Daly (1939). Under fairly general conditions the large-sample regions for / 
parameters which are smallest on the average are given by 


—_ dlogL dlogL 
= 2 
2 Fat {Hi 2 ae, \< S ka (20.74) 


where I-1 is the inverse matrix to the information matrix whose general element is 


_ pfdlogL dlogL 
Iy = E( = ) 


and y? is such that P(y? < 72) = 1—«, the probability being calculated from the 7? 
distribution with / degrees of freedom. ‘This is clearly related to the result of 17.39 
giving the minimum attainable variances (and, by a simple extension, covariances) of 
a set of unbiassed estimators of several parametric functions. 

In Volume 3, when we discuss the Analysis of Variance, we shall meet the problem 
of simultaneously setting confidence intervals for a number of means. 


Tolerance intervals 

20.37 Throughout this chapter we have been discussing the setting of confidence 
intervals for the parameters entering explicitly into the specification of a distribution. 
But the technique of confidence intervals can be used for other problems. We shall 
see in later chapters that intervals can be found for the quantiles of a parent distribution 
(cf. Exercise 20.17) and also for the entire distribution function itself, without any 
assumption on the form of the distribution beyond its continuity. ‘There is another 
type of problem, commonly met with in practical sampling, which may be solved by 
these methods. Suppose that, on the basis of a sample of 2 independent observations 
from a distribution, we wish to find two limits, Z, and L,, between which at least a 
given proportion y of the distribution may be asserted to lie. Clearly, we can only 
make such an assertion in probabilistic form, i.e. we assert that, with given probability , 
at least a proportion y of the distribution lies between L, and L,. L, and Ly are 
called tolerance limits for the distribution; we shall call them the (f, a tolerance Sfandes: 
The interval (L,, L,) is called a tolerance interval. In Chapter 32 we shall see that 
tolerance limits, also, may be set without assumptions (except continuity) on the 
form of the parent distribution. In this chapter, however, we shall discuss the 
derivation of tolerance limits for a normal distribution, due to Wald and Wolfowitz 


(1946). 


20.38 Since the sample mean and variance are a pair of sufficient statistics for the 
parameters of a normal distribution (Example 17.17), it is natural to base tolerance 
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limits for the distribution upon them. In a sample of size n, we work with the un- 
biassed statistics 
f= Sen s* = Zhe — £7) /(n—1), 


+ As’ | 
and define Ate ee | EE ds, (20.75) 
#— As! 


where f(t) is the normal frequency function. We now seek to determine the value 4 
so that 


P {A (&, s', A) > vy} =p. (20.76) 


L, = ®—As’ and L, = #+As’ will then be a pair of central (8, y) tolerance limits for 
the parent distribution. Since we are concerned only with the proportion of that 
distribution covered by the interval (L,, L.), we may without any loss of generality 
standardize the population mean at 0 and its variance at 1. ‘Thus 


f(t) = (20)-t exp (—}22). (20.77) 


20.39 Consider first the conditional probability, given %, that A (%, s’, 2) exceeds y. 
We denote this by P{A > y|}. Now A is a monotone increasing function of s’, 
and the equation in s’ 


A (%, s',A) = y (20.78) 
has just one root, which we denote by s’ (x, y, 4). Let 
AS (4, ¥, A) = TA, ¥)- (20,79) 


Given %and y,r = r(%, y) is immediately obtainable from a table of the normal integral, 
since 


i oo Sige gah (20.80) 


From (20.80) it is clear that r does not depend upon 4. Moreover, since A is monotone 
increasing in s’, the inequality A > y is equivalent to 

5. >: 8 (8, vA) = £8, ¥)/4- 
Thus we may write PiA-s 9 | y= Pie > 7 | i}, - (20.81) 


and since * and s’ are independently distributed, (20.81) becomes 
P{A > y|#} = P {(n—1)s? > (n—1)17/A?}. (20.82) 
Since (n—1)s’? = X(x—<)? is distributed like y? with (n—1) degrees of freedom, we 
have finally 
P{A>y|#}= P (61> (@#-1r/7}, (20.83) 
so that by using a table of the x? integral, we can determine (20.83). 


20.40 ‘To obtain the unconditional probability P(A > y) from (20.83), we must 
integrate it over the distribution of *, which is normal with zero mean and variance 
1/n. This is a tedious numerical operation, but fortunately an excellent approximation 
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is available. We expand P(A > y| #) in a Taylor series about = wu = 0, and since 
it is an even function of *, the odd powers in the expansion vanish,") leaving 


v2 
P(A>y|# =P(A> y|0)+5P"(A > 10) +3 (20.84) 
Taking expectations we have 
P(A > y|#) = P(A > y|0)+ 2. P(A sipleyase, 2 (20.85) 


But from (20.84) with * = 1/4/n 


P(A Ls, =) = P{A>7|0)+2-P"(4 > 7|0)+O(r). (20.86) 
(20.85) and (20.86) give 


/n 
P(A > 7) =P(A>y 


=) +0(n-), (20.87) 


and we may use (20.87) to find an approximate value for 2 in (20.83). Wald and 
Wolfowitz (1946) showed (see also Ellison (1964)) that the approximation is extremely 
good even for values of n as low as 2 if 8 and y are >0-95, as they usually are in practice. 
Bowker (1947) gives tables of A (his k) for 6 (his y) = 0-75, 0-90, 0-95, 0-99 and y (his 
P) = 0-75, 0-90, 0-99 and 0-999, for sample sizes n = 2 (1) 102 (2) 180 (5) 300 (10) 
400 (25) 750 (50) 1000. 

On examination of the argument above, it will be seen to hold if # is replaced by any 
estimator fi of the mean, and s’* by any independent estimator 6? of the variance, of a 
normal population, as pointed out by Wallis (1951). If the mean is estimated from a 
observations and the variance estimate has v degrees of freedom, Taguti (1958) gives 
tables of A (his k) for 6 (his 1—«) and y (his P) = 0-90, 0-95 and 0-99; and m = 0-5 (0:5) 
2 (1) 10 (2) 20 (5) 30 (10) 60 (20) 100, 200, 500, 1000, co; » = 1 (1) 20 (2) 30 (5) 100 (100) 
1000, oo. The small fractional values of m are useful in some applications discussed 
by Taguti. Weissberg and Beatty (1960) give tables of A/r (their u) for (their f) =1 
(1) 150 (2) 250 (5) 500 (10) 1000 (1000) 10,000, oo and f (their y) = 0-90, 0-95, 0-99; 
and of r for n = 1 (1) 100 (5) 200 (10) 300 (20) 500 (100) 1000 (1000) 10,000, co 
and -»-(ther P)} = 0-5, 0-75,.0-9, 0-95,.4)-99, 0-999. 

Fraser and Guttman (1956) and Guttman (1957) consider tolerance intervals which 
cover a given proportion of a normal parent distribution on the average. 


EXERCISES 
20.1 For a sample of n from the distribution 
xP—le—x/6 
a= Ty ges 0<x< O, p>, 


we have seen (Exercise 17.1) that, for known p, a sufficient statistic for 6 is */p. Hence 
derive confidence intervals for 6. 


(*) This is because the interval is symmetric about *; it could not happen otherwise. 
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20.2 Show that for the rectangular population 
df= dx/¢, G=sx <0 
and confidence coefficient 1 —«, confidence limits for 0 are t and t/y, where t is the sample 
range and y is given by 


yp {n—(n—1)y} = @. 
(Wilks, 1938c) 


20.3 Show that, for the distribution of the previous exercise, confidence limits for 
6 from samples of two, x, and x., are 
(x1 +%2)/[1+ {1—(1—«)#}]. 
(Neyman, 1937b) 


20.4 In Exercise 20.2, show also that if L is the larger of a sample of size two, confi- 
dence limits for 0 are 
| re ee 
and that if M is the largest of samples of size four, limits are 
M, M/c. 
(Neyman, 1937b) 


20.5 Using the asymptotic multivariate normal distribution of Maximum Likelihood 
estimators (18.26) and the x? distribution of the exponent of a multivariate normal distri- 
bution (15.10), show that (20.74) gives a large-sample confidence region for a set of 
parameters. From it, derive a confidence region for the mean and variance of a univariate 
normal distribution. 


20.6 In setting confidence limits to the variance of a normal population by the use 
of the distribution of the sample variance (Example 20.6), sketch the confidence belts 
for some value of the confidence coefficient, and show graphically that they always provide 
a connected range within which o? is located. 


90.7. Show how to set confidence limits to the ratio of variances oj /03 in two normal 
populations, based on independent samples of n, observations from the first and nm, 
observations from the second. (Use the distribution of the ratio of sample variances 
at (16.24).) 


20.8 Use the method of 20.10 to show that large-sample 95 per cent confidence 
limits for @ in the binomial distribution of Example 20.2 are given by 


1 alee pup) er 
1+ (1-96)?/n ? as £196, | ( n | 4n )}. 


20.9 Using Geary’s theorem (Exercise 11.11), show that large-sample 95 per cent 
confidence limits for the ratio @,/@, of the parameters of two binomial distributions, 
based on independent samples of size n., m, respectively, are given by 


P2/P1 (1-96)? 1—p, 1—p.  (1:96)?/ 1  4(1—d:) 
asat EM a dt of 41:96 /|——#!4——224, 5 (44 = * 7 Ih. 
1+(1:96)?/n, { 2n2P2 JI MP1 Meopo 4 (aa NyNopy )] \ 


(Noether, 1957) 


20.10 In Example 20.6, show that the confidence interval based on 


2 2 
11S ns 
Pr ag < o? = = oo 1 =O 
Xo Xi 


2 
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(where y5 and yj are the upper and lower 4« points of the y? distribution with (—1) d.f.) 
is not the physically shortest interval for o* based on the x? distribution of ns*/o?. 
(cf. ‘Tate and Klett, 1959) 


20.11 From two normal populations with means ju, and “, and variances of = o5 = o?, 
independent samples of sizes nm, and mn, respectively are drawn. Show that 


z = tes + nese (1 -1\-)% 
t= {(%1— 4) — (%,—/2) © gecstaast AGES 


(where £,, %, and s?, s3 are the sample means and variances) is distributed in ‘‘ Student’s ” 
distribution with (”,+”,—2) d.f., and hence set confidence limits for (u,—j). 


20.12 In Exercise 20.11, if of 403, show that the ratio distributed in ‘‘ Student’s ”’ 


distribution is no longer ¢t, but 
3 
) [ontme—2) e 
J 


, _ &y— #1) — @e— Ma) 1S} M253 
= 2 2 —— 2 
(oi /n,+03/n2)3 07 Oo 
20.13 If f(«|@) = g(x)/h(0), (a(8) < x < 5(6)), and b(@) is a monotone decreasing 
function of a (9), show (cf. 17.40-1) that the extreme observations x(1) and x(n) are a pair 
of jointly sufficient statistics for #6. From their joint distribution, show that the single 
sufficient statistic for 6, 


0 = min {a1 (xq), B72 (mn) }5 
has distribution 3 
n{h (8) }"-1 
{h (8) }" 
where 6* is defined by a (6*) = b(6*). 


A 


{—h’(8)}d6,  0<6< 6%, 


dF = 


20.14 In Exercise 20.13, show that y = h(6)/h(@) has distribution 
dF = ny" dy, O<y<il. 
Show that 
P{al/t < yp <1} = 1-2, 
and hence set a confidence interval for 8. Show that this is shorter than any other interval 


based on the distribution of y. 
(Huzurbazar, 1955) 


20.15 Apply the result of Exercise 20.14 to show that a confidence interval for @ in 
df? = . Os 8 


is obtainable from 


and that this is shorter than the interval in Exercise 20.2. 


20.16 Use the result of Exercise 20.14 to show that a confidence interval for @ in 
dF = e~(%—®) dx, 6<x<o 
is obtainable from 


1 
P{ s+ jee <6< awh = 1-«. 


(Huzurbazar, 1955) 
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20.17 If I(x) is a confidence interval for @ calculated from the distribution of a sample, 
f(x | 6), and 4, is the true value of 0, show that the expected length of I(x) may be written as 


iB = | { | an} dF (x | 95) 
6 « I(x) 


and that 
Ri = | prob {6 € I(x) | 69} dé, 
046, 


the integral over all false values of 6 of the probability of inclusion in the confidence 
interval. (Pratt, 1961, 1963) 


20.18 Ifx € A(6) if and only if 6 € I(x) in Exercise 20.17, show that E(L) is minimized 
by choosing A(6) for each 6 so that prob {x € A(@) | 99) is minimized. (This is equivalent 
to choosing the most powerful test for each 9 against the alternative value 0,—cf. 23.26 


below.) 
(Pratt, 1961, 1963) 


20.19 x and y have a bivariate normal distribution with variances oj, 03, and corre- 
lation parameter p. Show that the variables 


a Oe ey 
u=—t-, v= —--, 

os ed O, Ge 
are independently normally distributed. In a sample of n observations with sample 
variances s? and s? and correlation coefficient r,,, show that the sample correlation co- 
efficient of u and v may be written 

(J—4)? 

~ (L+A)?—472,10° 
where 1=s2/s2 and 4=0%/02. Hence show that, whatever the value of p, confidence 


limits for 4 are given by 
1 {K-—(K?—1)*}, 1 {K+(K*?—1)}} 


2 
Tuy 


where K=1+4+ 


and #2 is the 100« per cent point of ‘“‘ Student’s ”» ¢? distribution. 
(Pitman, 1939a) 


20.20 In 20.39, show that r(X,y) defined at (20.80) is, asymptotically in n, 


1 
Y as ~~ 7 0, 14+— ° 
(x, 7) ( »( = (Bowker, 1946) 


20.21 Using the method of Example 6.4, show that for a y? distribution with » 
degrees of freedom, the value above which 100 per cent of the distribution lies is Xp 


where 
2 2\% Z 1 
XB ~1+ () d;_p+—-(di_g—1)+0 (3). 
y v 3y v 


d 
where | = (27)-* exp (— $27) dt = «. 


20.22 Combine the results of Exercises 20.20-20.21 to show that, from (20.83), 


dp 5 (dj +2) 
Lwr 0041+ cbs 12n \. 


(Bowker, 1946) 


CHAPTER 21 
INTERVAL ESTIMATION: FIDUCIAL INTERVALS 


21.1 At the outset of this chapter it is desirable to make a few remarks on matters 
of terminology. Problems of interval estimation in the precise sense began to engage 
the attention of statisticians round about the period 1925-1930. The approach from 
confidence intervals, as we have defined them in the previous chapter, and that from fiducial 
intervals, which we shall try to expound in this chapter, were presented respectively 
by J. Neyman and by R. A. Fisher ; and since they seemed to give identical results there 
was at first a very natural belief that the two methods were only saying the same things 
in different terms. In consequence, the earlier literature of the subject often contains 
references to “fiducial ”’ intervals in the sense of our ‘‘ confidence ”’ intervals; and 
(less frequently) to ‘“‘ confidence ” intervals in some sense more nearly related to the 
‘“‘ fiducial” line of argument. 

Although this confusion of nomenclature has never been adequately cleared up, 
it is now generally recognized that fiducial intervals are different in kind from confidence 
intervals. But their devotees have, so it seems to us, not always made it quite clear 
where the difference lies; nor have they always used the term “ fiducial” in strict 
conformity with the usage of Fisher, who, having invented it, may be allowed the right 
of precedence by way of definition. We shall present what we believe to be the basic 
ideas of the fiducial approach, but the reader who goes to the original literature may 
expect to find considerable variation in terminology. 


21.2 To fix the ideas, consider a sample of size m from a normal population of 
unknown mean, mw, and unit variance. ‘The sample mean & is a sufficient statistic for y, 
and its distribution is 


dF = Jz) exp { —4n(x—)?} dx. (21.1) 


(21.1), of course, expresses the distribution of different values of * for a fixed 
unknown value of u. Now suppose that we have a single sample of observations, 
yielding a sample mean *,. We recall from (17.68) that the Likelihood Function of 
the sample, L(%,|), will (since # is sufficient for u~) depend on uw only through the 
distribution of # at (21.1), which may therefore be taken to represent the Likelihood 
Function. ‘Thus 


L(Ss|m) o [()exp (In (S.—m. (21.2) 


If we are prepared, perhaps somewhat intuitively, to use the Likelihood Function (21.2) 
as measuring the intensity of our credence in a particular value of u, we finally write 


ar ® ( Jz) pt Ente dhe (21.3) 
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which we shall call the fiducial distribution of the parameter yu. We note that the integral 
of (21.3) over the range (— 00, 00) for wis 1, so that no constant adjustment is necessary. 


21.3 This fiducial distribution is not a frequency distribution in the sense in 
which we have used the expression hitherto. It is a new concept, expressing the inten- 
sity of our belief in the various possible values of a parameter. It so happens, in this 
case, that the non-differential element in (21.3) is the same as that in (21.1). This 
is not essential, though it is not infrequent. 

Nor is the fiducial distribution a probability distribution in the sense of the fre- 
quency theory of probability. It may be regarded as a distribution of probability in 
the sense of degrees of belief ; the consequent link with interval estimation based on 
the use of Bayes’ theorem will be discussed below. Or it may be regarded as a new 
concept, giving formal expression to our somewhat intuitive ideas about the extent to 
which we place credence in various values of wu. 


21.4 'The fiducial distribution can now be used to determine intervals within 
which yu is located. We select some arbitrary numbers, say 0-02275 and 0-97725, and 
decide to regard those values as critical in the sense that any acceptable value of u 
must not give to the observed #, a (cumulative) probability less than 0-02275 or greater 
than 0-97725. Then, since these values correspond to deviations of -+-20 from the 
mean of a normal distribution, and o = 1/,/n, we have 

—2 < (%—p) Vn < 2, 

which is equivalent to 

,—2//n < w < %4+2/Vn. (21.4) 
This, as it happens, is the same inequality as that to which we were led by central 
confidence intervals based on (21.1) in Example 20.1. - But it is essential to note that 
it is not reached by the same line of thought. ‘The confidence approach says that if 
we assert (21.4) we shall be right in about 95-45 per cent of the cases im the long run. 
Under the fiducial approach the assertion of (21.2) is equivalent to saying that (in some 
sense not defined) we are 95-45 per cent sure of being right im this particular case. 'The 
shift of emphasis is evidently the one we encountered in considering the Likelihood 
Function itself, where the function L(x|6) can be considered as an elementary prob- 
ability in which 6 is fixed and «x varies, or as a likelihood in which ~ is fixed and 6 varies. 
So here, we can make an inference about the range of 6 either by regarding it as a con- 
stant and setting up containing intervals which are random variables, or by regarding 
the observations as fixed and setting up intervals based on some undefined intensity 
of belief in the values of the parameter generating those observations. 


21.5 There is one further fundamental distinction between the two methods. 
We have seen in the previous chapter that in confidence theory it is possible to have 
different sets of intervals for the same parameter based on different statistics (although 
we naturally discriminate between the different sets, and chose the shortest or most 
selective set). This is explicitly ruled out in fiducial theory (even in the sense that 
we may choose central or non-central intervals for the same distribution when using 


both its tails). We must, in fact, use all the information about the parameter which 
K 
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the Likelihood Function contains. ‘This implies that if we are to set limits to 0 by a 
single statistic ¢, the latter must be sufficient for 6. (We also reached this conclusion 
from the standpoint of most selective confidence intervals in 20.30.) 

As we pointed out in 17.38, there is always a set of jointly sufficient statistics for 
an unknown parameter, namely the 2 observations themselves. But this tautology 
offers little consolation : even a sufficient set of two statistics would be difficult enough 
to handle; a larger set is almost certainly practically useless. As to what should be 
done to construct an interval for a single parameter 6 where a single sufficient statistic 
doés not exist, writers on fiducial theory are for the most part silent. 


21.6 Let f(#,0) be a continuous frequency function and F(t, 6) the distribution 
function of a statistic t which is sufficient for 6. Consider the behaviour of f for some 
fixed t, as 0 varies. Suppose also that we know beforehand that 6 must lie in a certain 
range, which may in particular be (— 0, 00). ‘Take some critical probability 1—« 
(analogous to a confidence coefficient) and let 6, be the value of 0 for which F'(t,0) = 1—«. 

Now suppose also that over the permissible range of 0, f(¢,,0) is a monotonic non- 
increasing function of 6 for any ¢;. Then for all 0 < 0, the observed ¢, has at least as 
high a probability density as f(¢,, 0,), and for 0 > 6, it has a lower probability density. 
We then choose 6 < 6, as our fiducial interval. It includes all those values of the 
parameter which give to the probability density a value greater than or equal to f (4, 4,). 


21.7 If we require a fiducial interval of type 
a a 


we look for two values of 0 such that f(t,, 9,,) = f(t, 9,,) and F (t,, 6,,) —F (#1, 9,) = 1-«. 
If, between these values, f(¢,,6) is greater than the extreme values f(t, 0,,) or f(t1, 9:,), 
and is less than those values outside it, the interval again comprises values for which 
the probability density is at least as great as the density at the critical points. 

If the distribution of t is symmetrical this involves taking a range which cuts off 
equal tail areas on it. For a non-symmetrical distribution the tails are to be such that 
their total probability content is «; but the contents of the two tails are not equal. 
It is the extreme ordinates of the interval which must be equal. Similar considerations 
have already been discussed in connexion with central confidence intervals in 20.7. 


21.8 On this understanding, if our fiducial interval is increased by an element dé 
at each end, the probability ordinate at the end decreases by (0F (t,,6)/00)d0. For 
the fiducial distribution we then have 

_ _0F (ti, 9) 
eee 
This formula, however, requires that f(t,,6) shall be a non-decreasing function of 0 
at the lower end and a non-increasing function of 6 at the upper end of the interval. 


dd. (21.5) 


Example 21.1 


Consider again the normal distribution of (21.1). For any fixed *,, as uw varies 
from — o through #, to + 0, the probability density varies from zero monotonically 
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to a maximum at *, and then monotonically to zero. ‘Thus for any value in the range 
*%,—k to &,+k the density is greater than at the points é,—k or ¥, +k. Wecan there- 
fore set a fiducial interval 

X,—k SES Ki +k, 


for any convenient value of k > 0. In (21.4) we took k to be 2/4/n. 


Example 21.2 
As an example of a non-symmetrical sampling distribution, consider the distribution 
xP-1 e—*/6 


dis PT (p) oo 7 Se US eee. (21.6) 


If p is known, t = %/p is sufficient for 0 (cf. Exercise 17.1) and its sampling distribution 


is easily seen to be 
gee 2 
-(9) “ray * _— 


where f = np. Now in this case 0 may vary only from 0 to oo. As it does so the 
ordinate of (21.7) for fixed ¢ rises monotonically from zero to a maximum and then 
falls again to zero, being in fact an inversion of a Type III distribution. ‘Thus, if we 
determine 6, and 6,, such that the ordinates at those two values are equal and the 
integral of (21. 7) between them has the assigned value 1—«, the fiducial range is 
= = 2. 

We may " write (21.7) in the form 


= (Y sige) as 


and hence 
Bt/6 ye} e-4 


F(t,6) = | ba (21.9) 


ce 
00 T'(B) Juapte 00 \ 6 
Bie et 
“(G) ree 
Thus the fiducial distribution of 6 is 
od ate 21.1 
(a) ra ae 
The integral of this from 6 = 0 to 0 = o is unity. 
In comparing (21.7) with (21.10) it should be noticed that we have replaced: dt, 
not by d0, but by td0/0; or, putting it slightly differently, we have replaced dt/t by 
d0/@. It is worth while considering why this should be so, and to restate in specific 


form the argument of 21.8. 
We determine our fiducial interval by reference to the probability F(¢,6). Looking 
at (21.9) we see that this is an integral whose upper limit is, apart from a constant, ¢/0. 


Thus 
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Thus for variation in 0 we have the ordinate of the frequency function (the integrand) 
multiplied by d,(t/6) = —td0/6?, while for variation in ¢ the multiplying factor is 
d,(t/0) = dt/0. "Thus, from (21.5), —(0F/00)d6 = td6/6?, while (0f/0t) dt = dt/0. 
It is by equating these expressions that we obtain d0/0 = dt/t. 


21.9 When we try to extend our theory to cover the case where two or more para- 
meters are involved, we begin to meet difficulties. In point of fact, practical examples 
in this field are so rare that any general theory is apt to be left in the air for want of 
exemplification. We shall therefore concentrate the exposition on two important 
standard cases, the estimation of the mean in normal samples where the variance is 
unknown, and the estimation of the difference of two means in samples from two 
normal populations with unequal variances. 


Fiducial inference in “ Student’s ” distribution 


21.10 It is known that in normal samples the sample mean *# and the sample 
variance s?(= &(x—<)?/n) are jointly sufficient for the parent mean y and variance o”. 
Their distribution may be written as 


1 _ Sg SSE 2 ge ns?) ds 
dP cc “exp | — 7308 wy} as (2) exp {7h e. (21.11) 


If we were considering fiducial limits for «4 with known o, we should use the first factor 
on the right of (21.11) ; but if we were considering limits for o with known yu we should 
not use the second factor, the reason being that o itself enters into the first factor. In 
fact (cf. Example 17.10), the sufficient statistic in this case is not s? but &(x—y)?/n, 
whose distribution is obtained by merging the two factors in (21.11). 

For known o, we should, as in Example 21.1, replace d¥ by du to obtain the fiducial 
distribution of w. For known pu, we should use the fact that & (w—y)? = s’? is distri- 
buted like x in (21.6) with p = m and 6 = o?, and hence, as in Example 21.2, replace 
ds'/s' by da/o. In (21.11), s is distributed as s’, but with p = —1. The question 
is, can we here replace d%ds/s in (21.11) by duda/o to obtain the joint fiducial distri- 
bution of uw and o? 

Fiducialists assume that this is so. ‘The question appears to us to be very debat- 
able.“) However, let us make the assumption and see where it leads us. For the 
fiducial distribution we shall then have 


1 n See ns*)\ do 
dF x — ———(#—)? = ee eS 1.12 
oC — exp { 58 (%— 2) \ du (:) exp { sit = (21.12) 
We now integrate for o to obtain the fiducial distribution of y. 


We arrive at dF <& aK ee e (21.13) 
1 Cele = 
ar or 


(*) Although & and s are statistically independent, u and o are not independent in any fiducial 
sense. ‘The laws of transformation from the frequency to the fiducial distribution have not 
been elucidated to any extent for the multi-parameter case. In the above case some support for 
the process can be derived a posteriori from the reflexion that it leads to “‘ Student’s ” distribu- 
tion, but if fiducial theory is to be accepted on its own merits, something more is required. 
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This is a form of ‘“* Student’s ”’ distribution, with — 4/(n—1) in place of the usual 


t, and n—1 degrees of freedom. ‘Thus, given «, we can find two values of ¢, f) and ¢,, 
such that 

P{-t,<ti<t}=1-« 
and this is equivalent to locating mu in the range 


{G—Sto//(n—1), #4+5t,//(n—1)}. (21.14) 
This may be interpreted, as in 20.31, in the sense of confidence intervals, 1.e. as implying 
that if we assert yu to lie in the range (21.14) we shall be right in a proportion 1—« of 
the cases. But this is by no means essential to the fiducial argument, as we shall see 
later. 


The problem of two means 


21.11 We now turn to the problem of finding an interval estimate for the difference 
between the means of two normal distributions, which was left undiscussed in the previous 
chapter in order to facilitate a unified exposition here. We shall first discuss several 
confidence-interval approaches to the problem, and then proceed to the fiducial-interval 
solution. Finally, we shall examine the problem from the standpoint of Bayes’ theorem. 


21.12 Suppose, then, that we have two normal distributions, the first with mean 
and variance parameters 4,07 and the second with parameters 2,03. Samples of 
SiZe 2, Ny respectively are taken, and the sample means and variances observed are &,, s? 
and X,,s2. Without loss of generality, we assume 7, < Ng. 

Now if =e, =", the problem of finding an interval for 4,—[l, = 6 1s simple. 
For in this case d = *,—, is normally distributed with 

E(d) = 4d, 
4 
var d = o° (+5). (21.15) 
Ny Neg 
and n,s°/o?, n,s3/o% are each distributed like y? with n,—1, n,—1 d.f. respectively. 
Since the two samples are independent, (7, 57+ ,53)/o? will be distributed like 7? with 
n,+n,—2 d.f., and hence, writing 


s? = (1 S{+ M2 5z)/(Ny+N2—2) 


we have 
Ei (s*) = 0°. (21.16) 
Now 
- d—6é (=) 
— Tgeeer & saa, a e' i 
{0 (--+;.)} / 2 (21.17) 
N, No 
oie EE sae (21.18) 


"Geta 


is a ratio of a standardized normal variate to the square root of an unbiassed estimator 
of its sampling variance, which is distributed independently of it (since s? and s$ are 
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independent of #, and #,). Moreover, (7,+,—2)s?/o? is a y? variable with n,+n,—2 
d.f. Thus y is of exactly the same form as the one-sample ratio 


wa eee 


which we have on several occasions (e.g. Example 11.8) seen to be distributed in 
‘“ Student’s ” distribution with m,—1 df. Hence (21.18) is also a ‘ Student’s ” 
variable, but with n,+n,—2 d.f., a result which may easily be proved directly. 

There is therefore no iieralie in setting confidence intervals or fiducial intervals 
for 6 in this case: we simply use the method of 20.31 or 21.10, and of course, as in 
the one-sample case, we obtain identical results, quite incidentally. 


21.13 When we leave the case o? = 03, complications arise. ‘The variate distri- 
buted in “ Student’s ” form, with n,+n,—2 d.f., by analogy with (21.17), is now 


d—6 nist nase 

‘= Tot oat |) ot * of 
“14 72 is as SS (21.19) 
Nh Ns Ny+n,—2 


The numerator of (21.19) is a standardized normal variate, and its denominator is the 
square root of an independently distributed y? variate divided by its degrees of freedom, 
as for (21.17). The difficulty is that (21.19) involves the unknown ratio of variances 
0 = o2/o%. If we also define u = s?/s2, N = n,/n,, we may rewrite (21.19) as 


(d—0) (nm, +n2— ay (21.20) 


which clearly displays its dependence upon the unknown 6. If 6 = 1, of course, 
(21.20) reduces to (21.18). 


[= 


21.14 We now have to consider methods by which the ‘‘ nuisance parameter,”’ 6, 
can be eliminated from interval statements concerning 6. We must clearly seek some 
statistic other than ¢t of (21.20). One possibility suggests itself immediately from 
inspection of the alternative form, (21.18), to which (21.17) reduces when 0 = 1. The 
statistic 


els (21.21) 


is, like (21.18), the ratio of a normal variate with zero mean to the square root of an 
independently distributed unbiassed estimator of its sampling variance. However, 
that estimator is not a multiple of a y? variate, and hence z is not distributed in 
“Student’s”’ form. ‘The statistic g is the basis of the fiducial approach and one 
approximate confidence interval approach to this problem, as we shall see below. 

An alternative possibility is to investigate the distribution of (21.18) itself, i.e. to 
see how far the statistic appropriate to the case 0 = 1 retains its properties when 0 # 1. 
This, too, has been investigated from the confidence interval standpoint. 

However, before proceeding to discuss the approaches outlined in this section, w 
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examine at some length an exact confidence interval solution to this problem, based 
on ‘ Student’s”’ distribution, and its properties. ‘The results are due to Scheffé 


(1943a, 1944). 


Exact confidence intervals based on “ Student’s ” distribution 

21.15 If we desire an exact confidence interval for 6 based on the ‘ Student ”’ 
distribution, it will be sufficient if we can find a linear function of the observations, 
L, and a quadratic function of them, Q, such that, for all values of o?, 03, 


(i) L and OQ are independently distributed ; 
(ii) E(L) = 6 and varL = V; and 
(iii) O/V has a x? distribution with k d.f. 
- L- 6 
(Q/k)* 
has “ Student’s ”’ distribution with k d.f. We now prove a remarkable result due to 
Scheffé (1944), to the effect that no statistic of the form (21.22) can be a symmetric 
function of the observations in each sample ; that is to say, ¢ cannot be invariant under 
permutation of the first sample members x,,(¢ = 1, 2,..., 7) among themselves and 
of the second sample members x.;(¢ = 1, 2,..., m2) among themselves. 


Then (21.22) 


21.16 Suppose that ¢ is symmetric in the sense indicated. ‘Then we must have 


LZ = Cy Uxyz+C, UX, 
(21.23) 


a 2 D : 
QO= CyUN tl, DW XX tls UHZ+Ce Y Noi X oj + C7 Us X14 X95, 
i=4j t=4j i,j 


where the c’s are constants independent of the parameters. 
Now from (21.22) | 


E(L) = 6 = jy—Hy (21.24) 
while from (21.23) 
E(L) = cyny y+ CoM Ue. (21.25) 
(21.24) and (21.25) are identities in wu, and w,; hence 
CyMy hy = Py Cglebe = — fea 
so that 
tC, = f/m, t= =F), (21.26) 
From (21.26) and (21.23), 
L = &,—%, = d, (21.27) 
and hence 
var L = V = o3/n,+09/n». (21.28) 
Since O/V has a x? distribution with k d.f., 
E(Q/V) = k, 
so that, using (21.28), 
E(Q) = k(o7/n, + 03/n,), (21.29) 


while, from (21.23), 


E(Q) = ¢3m, (of + M3) + 04M (My — 1) ME + C5 M2 (09+ 45) 
+ Cg Mo (Mg—1) ue + C7 Ny Ng [My fo. (21.30) 
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Equating (21.29) and (21.30), we obtain expression for the c’s, and thence, from (21.23), 
i 
QO = taoitata } (21.31) 
(21.27) and (21.31) reduce (21.22) to (21.21). Nowa linear function of two independent 
xy? variates can only itself have a y? distribution if it is a simple sum of them, and 


n,s{/o{ and n,s3/o$ are independent y? variates. ‘Thus, from (21.31), O will only 
be a y? variate if 


sh tee Peak, Ee 
m(m,—1) n,(n,—1) 


Or 


= 4 (21.32) 


Given 7, 15, this is only true for special values of of, of. Since we require it to be 
true for all values of of, o} we have established a contradiction. ‘Thus ¢ cannot be 
a symmetric function in the sense stated. 


21.17 Since we cannot find a symmetric function of the desired type having 
‘“‘Student’s” distribution, we now consider others. We specialize (21.22) to the 
situation where 

L= = d;/ny, 
= (21.33) 
Q=-3(@-D 
NM, =1 


and the d; are independent identical normal variates with 
E(d,;) = 6, var d,== 6", .2all £. (21.34) 
It will be remembered that we have taken n, < my. (21.22) now becomes 
= L—6 oF) (me D 
= igo OSD) ei 
which is a ‘‘ Student” variate with (n,—1) df. 
Suppose now that in terms of the original observations 


fee ee ee (21.36) 
j=l 


The d; are multinormally distributed, since they are linear functions of normal variates 
(cf. 15.4). Necessary and sufficient conditions that (21.34) holds are 


LCi; = 2 

j 

2G, =e, (21.37) 
C3; Ch; = 0, 1 a R, 

j 


Thus, from (21.36) and (21.37) 
Vard,-=e" = 0? + c* 0%. (21.38) 


INTERVAL ESTIMATION : FIDUCIAL INTERVALS 143 

21.18 ‘The central confidence interval, with confidence coefficient 1—.«, derived 
from (21.35) is 

Ci ee ee esi ee a (21.39) 


where ¢,,1,, 18 the appropriate deviate for m,—1 d.f. The interval-length 7 has 
expected value, from (21.39), | 


E(l) = 2y-sat oat 1("2) } (21.40) 


the last factor on the right being found, from the fact that , O/o? has a y? distribution 


with n,—1 d.f., to be 
nO\!} _— V2 1 (n,) 
ea) }  ra@=D) — 
To minimize the expected length (21.40), we must minimize o, or equivalently, mini- 
mize c? in (21.38), subject to (21.37). ‘The problem may be visualized geometrically 
as follows: consider a space of m, dimensions, with one axis for each second suffix of 
the c;;, Then Xc;; = 1 is a hyperplane, and Xej} = c? is an m,-dimensional hyper- 


J 
sphere which is intersected by the plane in an (7,—1)-dimensional hypersphere. We 
require to locate nm, < mn, vectors through the origin which touch this latter hypersphere 
and (to satisfy the last condition of (21.37)) are mutually orthogonal, in such a way 
that the radius of the m,-dimensional hypersphere is minimized. 'This can be done 
by making our vectors coincide with n, axes, and then c? = 1. But if n, < m,, we 
can improve upon this procedure, for we can, while keeping the vectors orthogonal, 
space them symmetrically about the equiangular vector, and reduce c? from 1 to its 
minimum value n,/n,, as we shall now show. 


21.19 Written in vector form, the conditions (21.37) are 
c,u = 1 
C2. =<¢" f = k, (21.42) 
ae gag Se 
where c, is the 7th row vector of the matrix {c;;} and u is a row vector of units. 

If the m, vectors ec; satisfy (21.42), we can add another (n,—m,) vectors, satisfying 
the second (normalizing and orthogonalizing) condition of (21.42), so that the aug- 
mented set forms a basis for an m,-space. We may therefore express u as a linear 
function of the n, c-vectors, 


u = 2 24€p (21.43) 
where the g, are scalars. Now, using (21.42) and (21.43), 
=44. = ©; 3 gue = Dig, C; Cy 
= gic 


Thus 
a: = we, 1 = 2 bp age (21.44) 
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Also, since u is a row vector of units, 

No = uu = (Lg, C;,\(U LC; )\, 

2 (788 r)(~8 t) 
which, on using (21.42), becomes 


Ne 
= ee U 
No = p> L£4CrCE 


k=1 
= c( = ) et (21.45) 
k=1 +1 


Use of (21.44) gives, from (21.45), 
Ne = cn Jet py ei} 
+1 


or 
ny 
No oo oC 
Hence 
c? > n,/Ng, (21.46) 


the required result. 


21.20 The equality sign holds in (21.46) whenever g, = 0 for k = m,+1,..., Ms. 
Then the equiangular vector u lies entirely in the space spanned by the original 1, 
c-vectors. From (21.44), these will be symmetrically disposed around it. Evidently, 
there is an infinite number of ways of determining c,,, merely by rotating the set of 
n, vectors. Scheffé (1943a) obtained the particularly appealing solution 


Cj; = (ny /n2)* — (nye) *+1/Ng, pss, Fae 
tg = —(n,m,)*+1/n,, j(#i)=1,2,...,m, (21.47) 
ty = it, 2 Se 


It may easily be confirmed that (21.47) satisfies the conditions (21.37) with c? = n,/ng. 
Substituted into (21.36), (21.47) gives 


d, = 4; — (y/1g)* Xo + (My M2) * % vay (1/ms) > X 259 (21.48) 
as a 
which yields in (21.33) 
L = t,—X2, 
m 21.49 
O _— 23 4, ( ) 
Ny i=1 
where 
Uz = Xyy—(m1/Me)! Xo4, 
my (21.50) 
u = Dy u,;,/Ny 
i=1 
Hence, from (21.35) and (21.48-21.50), 
(24, pee (21.51) 
: : 2 (u;—u)? 


is a “ Student’s ” variate with 2,—1 d.f., and we may proceed to set confidence limits 
for 6 = U,— fo. 
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21.21 It is rather remarkable that we have been able to find an exact solution of 
the confidence interval problem in this case only by abandoning the seemingly natural 
requirement of symmetry. (21.51) holds for any randomly selected subset of n, of 
the n, variates in the second sample. Just as, in 20.22, we resorted to randomization 
to remove the difficulty in making exact confidence interval statements about a discrete 
variable, so we find here that randomization alone allows us to bypass the nuisance 
parameter 0. But the extent of the randomization should not be exaggerated. ‘The 
numerator of (21.51) uses the sample means of both samples, complete; only the 
denominator varies with different random selections of the subset in the second sample. 
It is impossible to assess intuitively how much efficiency is lost by this procedure. 
We now proceed to examine the length of the confidence intervals it provides. 


21.22 From (21.38) and (21.46), we have for the optimum solution (21.48), 
var d; = o? = of +(n,/n2) 03. (21.52) 
Putting (21.52) into (21.40), and using (21.41), we have for the expected length of 


the confidence interval 
: 2\2 ./2T ($24) 
a 1 la VEE (303) 1.53 
= nat mnt) | Pde) tes 
We now compare this interval / with the interval Z obtained from (21.19) if 6 = of/o3 
is known. The latter has expected length 


oS oi /n,+03/Ng ; N, St | NgS5 ' y) 
E(L) = Dbncrm ta | etter py eft ae (21.54) 
the last factor being evaluated from the ? distribution with (m,+u,—2) d.f. as 
V2 {3 (m1 +n2—1)} (21.55) 


[{3(a;+n,—2)} © 
(21.53-55) give for the ratio of expected lengths 
— ty-ta (Mitme—2\? Pda) PT {3 (m1 4+2,—2)} 

ee eer) ree rae 9 
As 1,—> ©, with n, fixed, each of the three factors of (21.56) tends to 1, and there- 
fore the ratio of expected interval length does so, as is intuitively reasonable. For 
small ,, the first two factors exceed 1, but the last is less than 1. The following table 
gives the exact values of (21.56) for 1—« = 0-95, 0-99 and a few sample sizes. | 


Table of E(!) /E(L) (from Scheffé, 1943a) 


n,—1 1—-a = 0-95 1—a = 0-99 
n,—1 5 10 20 40 0 5 10 20 40 00 


5 15 4-20: - -23— _-125..--728 ¥-27- -1-36-- 1-42 147 — 1:52 
10 105 4-07 446 3441 1-4 Pts 1°16 120 
20 1:03 1:03 1:05 1:05 1:06 1-09 
40 1:01 1:02 1:02 1:04 
00 1 1 
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Evidently, / is a very efficient interval even for moderate sample sizes, having an 
expected length no greater than 11 per cent in excess of that of L for n,—1 2 10 at 
1—a« = 0:95, and no greater than 9 per cent in excess for n,—1 > 20 at l—« = 0-99. 
Furthermore, we are comparing it with an interval based on knowledge of 6. ‘Taking 
this into account, we may fairly say that / puts up a very good performance indeed : 
the element of randomization cannot have resulted in very much loss of efficiency. 

In addition to this solution to the two-means problem there are also approxi- 
mate confidence-interval solutions, which we shall now summarize. 


Approximate confidence-interval solutions 
21.23 Welch (1938) has investigated the approximate distribution of the statistic 
(21.18), which is a “ Student’s ” variate when of = 03, in the case of # o3. In this 
case, the sampling variance of its numerator is 
var (d—6) = of/ny+03/No, 
so that, writing 


u = (d—6)/(o{/n,+ 05/n2)', 
(21.57) 


oa (Let) /(t% 
ie) oe rie 
y= u/w. (21.58) 
The difficulty now is that w?, although distributed independently of w, is not a multiple 
of a y? variate when 0 4 1. However, by equating its first two moments to those of 
a v2 variate, we can determine a number of degrees of freedom, , for which it is approxi- 

mately a y? variate. Its mean and variance are, from (21.57), 
E(w?) = b(v,0+7,), 

var (w?) = 2b?(r,0?+7,), = 


(21.18) may be written 


where we have written 


a Nie Tet Ma 21.60 
b bats bey La eee (21.60) 


If we identify (21.59) with the moments of a multiple g of a y? variate with » d.f., 


I 


w= Er, fy = Det, (21.61) 

we find g= oo 
21.62 
vy = (09, +%2)?/(0? 71+ %9).) 


With these values of g and 1, w?/g is approximately a y? variate with y degrees of free- 


dom, and hence, from (21.57), 
q2) #4 
t= 0 / {= (21.63) 


is a “ Student’s ” variate with vy d.f. If 6 = 1, »y = 7,47, = m,+n,—-2, g = 5b = 1/, 
and (21.63) reduces to (21.18), as it should. But in general, g and » depend upon 8. 


21.24 Welch (1938) investigated the extent to which the assumption that 0 = 1 
in (21.63), when in reality it takes some other value, leads to erroneous conclusions. 
His discussion was couched in terms of testing hypotheses rather than of interval esti- 
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mation, which is our present concern, but his conclusion should be briefly mentioned. 
He found that, so long as 2, = mz, no great harm was done by ignorance of the true 
value of 6, but that if m, 4 m., serious errors could arise. ‘To overcome this difficulty, 
he used exactly the technique of 21.23 to approximate the distribution of the statistic z 
of (21.21). In this case he found that, whatever the values of m, and m,, 2 itself was 
approximately distributed in ‘ Student’s””’ form with 


degrees of freedom, and that the influence of a wrongly assumed value of 0 was now 
very much smaller. This is what we should expect, since the denominator of z at 
(21.21) estimates the variances o7, of separately, while that of (21.58) uses a “ pooled ” 
estimate s? which is clearly less appropriate when of # 03. 


Lawton (1965) extends earlier work by J. Hajek to obtain close bounds upon the size 
and power of the “ equal-tails ’’ test based on 2. 

As 0--> 0, 0, (21.64) -->,—1, n,—1 respectively. Mickey and Brown (1966) 
show that the distribution of z is bounded by ‘‘ Student’s ” t-distributions with m,+7,—2 
and min (m,—1, m,—1) d.fr. 


21.25 Welch (1947) has refined the approximate approach of the last section. 
His argument is a general one, but for the present problem may be summarized as 
follows. Defining s?, s2 with n,—1,,—1 as divisors respectively, so that they are 
unbiassed estimators of variances, we seek a statistic h(s7, s3, P) such that 


P{(d—6) < h(si, 83, P)} = P (21.65) 
whatever the value of 6. Now since (d—6) is normally distributed independently of 
s?, s2, with zero mean and variance o7/n,+03/n, = D?, we have 


P{(d—8) < h(3,83, P)|8,2}.= (5) - (21.66) 
where I(x) = | _ 2n)Fexp(—4 0°) de Thus, from (21.65) and (21.66), 
Pa | | 1 (h/D) f (s2) f (s2) ds? ds?. (21.67) 


Now we may expand J(h/D), which is a function of s{,s3, in a ‘Taylor series about 
the true values 07,023. We write this symbolically 


: ( (sh Py = { = oi) 1{ "Ce 2)}, (21.68) 
tal 


AY 


where the operator 0; represents differentiation with respect to s?, and then putting 
s? = o?, and 5? = s?/n,+s3/n,. We may put (21.68) into (21.67) to obtain 


ps 1 | ite a(s) x jae (21.69) 


Now since we have 


1 y, s2\2%-1 v,S vy," 
2 a = a“t — uv4a 
Haast = nosy (Gal) exp ( saa) a( a). 
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on carrying out each integration in the symbolic expression (21.69) we find 


29 \—3n 
| exp{ (io) 4} / 68) dst = (1-2) exp(—o# 
which, put into (21.69), gives | 


2 20\—#" 
P= (1 ae) exp (— 0? 2) {2 (Si, 82 =}, (21.70) 
i=1 i . 


We can solve (21.70) to obtain the form of the function h, ee hence find A(s?, s3, P), 
for any known P. 
Welch gave a series expansion for h, which in our special case becomes 


h(sehP) s[ts Fk a date a5 B tt. = (21.71) 


s 
= See a 
where c; = + (243) , ¥; = n,—1 and & is defined by J(&) = P 
1 2 

Since (d—6)/s = 2 of (21.21), (21.71) gives the distribution function of 2. 

Following further work by Welch, Aspin (1948, 1949) and Trickett et al. (1956), 
(21.71) has now been tabled as a function of »,,¥, and c,, for P = 0-95, 0-975, 0-99 
and 0-995. These tables enable us to set central confidence limits for 6 with l—« 
= 0-90, 0:95, 0-98 and 0:99. Some of the tables are reproduced as ‘Table 11 of the 
Biometrika Tables. 

Asymptotic expressions of the type (21.71) have been justified by Chernoff (1949) 
and D. L. Wallace (1958). (21.71) is asymptotic in the sense that each succeeding 
term on the right is an order lower in 1. 

Wald (1955) carried the Welch approach much further for the case my = ng. 

Press (1966) shows that for 1—« = 0-90, m, < n, < 30, Welch’s intervals have 
smaller expected length than (21.53) if 6 is small (when (21.51) discards information 
about the more variable population) but not if 0 is large. ‘The two sets of intervals 
are shown to be asymptotically equivalent, and never differ by more than 10 per cent 
in expected length when n,>10 if -01 < 6 < 100. 


The fiducial solution 
21.26 ‘The fiducial solution of the two-means problem starts from the joint distri- 
bution of sample means and variances, which may be written 


1 2 
dF = 16 5,2 IG yo 1 Ha) 9 Fa [L2) | de, dix 
O71 
ee es oe 
on oi exp - == = ds, sz. (21.72) 
In accordance with the fiducial argument, we replace d%,,d*, by du,,du, and ds,/s,, 


ds,/s, by do,/o,, do,/o,, as in 21.10. Then for the fiducial distribution (omitting 
powers of s, and s,, which are now constants) we have 


1 = 
OF et cace XP 4 — 3h (Bs— Ma) — 5 (a Ha) dita die 
— ee \ 20; 


a" 
xp MS, 252 da,d 21.73 
. fn 202 4 7100 ( ) 
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Writing | 
i= (4 —#1)-/(m1— 1). i, = Gs 8 )-Va—1) (21.74) 


Sy So 


we find, as in (21.10), the joint distribution of uw, and pu, 
= d,t, d,,te 

{1+ ¢/(n,—1)}™ {1+ #/(n,—1)} 
where we write d,,¢, to remind ourselves that the differential element is 4/(m,—1) du,/s, 
and similarly for the second sample. 


We cannot proceed at once to find an interval for 6 = uw,—p,. In fact, from 
(21.74) we have 
(M1 —%1)—(Ug—H_) = O—d = t8,// (my — 1)—to52/+/(m2,—1), (21.76) 
and to set limits to 6 we require the fiducial distribution of the right-hand side of (21.76) 
or some convenient function of it. ‘This is a linear function of ¢, and ¢,, whose fiducial 
distribution is given by (21.75). In actual fact Fisher (1935b, 1939), following Behrens, 
chose the statistic (21.21) 


dF (21.75) 


as the most convenient function. We have 
2, = t,cosy—f,siny, (21.77) 
where 


tan?y = $3 / Boat (21.78) 


n,—1/ n,—1 
For given y the distribution of z (usually known as the Fisher—Behrens distribution) 
can be found from (21.76). It has no simple form, but Fisher and Yates’ Statistical 
Tables give tables of significance points for z with assigned values of 7, m,, y, and the 
probability 1—«. In using these tables (and in consulting Fisher’s papers generally) 
the reader should note that our s?/(n—1) is written by him as s’2. 


21.27 In this case, the most important yet noticed, the fiducial argument does not 
give the same result as the approach from confidence intervals. ‘That is to say, if we 
determine from a probability 1—« the corresponding points of z, say %) and z,, and 
then assert 


a= s? a Se = 2 
X1—X_g— Sg < Mia -e ~<a Ny— Not Fy ee Gee ’ (21.79) 


we shall not be correct in a proportion 1—« of cases in the long run, as is obvious 
from the fact that z may be expressed as 


= {ee zs Net ies Dia a 
(ng—1)8+(m-1)8 Sf \ m+m,—2 
where ¢, defined by (21.20), has an exact “‘ Student’s ” distribution. Since ¢ is distri- 
buted independently of 6, = cannot be. 
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This fact has been made the ground of criticism by adherents of the confidence- 
interval approach. The fiducialist reply is that there is no particular reason why such 
statements should be correct in a proportion of cases in the long run; and that to 
impose such a desideratum is to miss the point of the fiducial approach. We return 
to this point later. 


Bayesian intervals 

21.28 We proceed now to consider the relation between fiducial theory and interval 
estimation based on Bayes’ theorem, as developed by Jeffreys (1948). ‘The theorem 
(8.2-3) states that the probability of g, on data p and H is proportional to the product 
of the probability of g. on H and the probability of p on g, and H. Symbolically 

P{q,|p, H} « P(q,|H)P(P| 9» 2). (21.80) 

From the Bayesian viewpoint, we take q, to be a value of the parameter 6 under estimate 
and P(q,|H) as its prior distribution in probability. P{g,|p,H} then becomes the 
posterior probability distribution of 6 and we can use it to set limits within which 6 
lies, to assigned degrees of probability zm this sense. | 

The major problem, as we have noted earlier, is to assign values to the prior distri- 
bution P(q,|H). Jeffreys has extended Bayes’ postulate (which stated that if nothing 
is known about 0 and its range is finite, the prior distribution should be proportional 
to d0) to take account of various situations. In particular, (1) if the range of 6 is infinite 
in both directions the prior probability is still taken as proportional to d@; (2) if 0 
ranges from 0 to oo the prior distribution is taken as proportional to d0/0. 


Example 21.3 
In the case of the normal distribution considered in 21.2 we have, with # sufficient 
for y, 


Z Se ee 
P(#|u,H) = aay? { ea), (21.81) 
and if can lie anywhere in (— 00, + 00), the prior distribution is taken as 
P(du|H) = du. (21.82) 
Hence, for the posterior distribution of y, 
- 1 pilose cs Ps 
P(du|%,H) x (anyi™P { 5 2) \ du (21.83) 


Integration over the range of ~ from — oo to oo shows that the proportionality is in 
fact an equality. Thus we may, for any given level of probability, determine the 
range of uw. This is, in fact, the same as that given by confidence-interval theory or 
fiducial theory. 
On the other hand, for the distribution (21.6) of Example 21.2 we take the prior 
distribution of 0, which is in (0, 0), to be 
P(d6|H) = d6/0. (21.84) 
The essential similarity to the fiducial procedure in Example 21.2 will be evident. We 


also have 
B B ees pee d0 
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Evaluation of the constant, required to make the integral from 0 = 0 to oo equal 
to unity, gives 
B tP e—6t/ 6 


P(d0|t,H) = (5) aT 


which again is the same distribution as the one obtained by the confidence-interval 
and the fiducial approaches. 


(21.86) 


21.29 Let us now consider the case of setting limits to the mean in normal samples 
when the variance is unknown. For the “ Student”’ distribution we have 


k dt 
P(dt|u,o, H) = (14 B/pyern? 


where k& is some constant and » = n—1. ‘The parameters uw and o do not appear on 
the right and hence are irrelevant to P(dt|H) and may be suppressed. 'Thus 


(21.87) 


kdt 
P(dt| H) = (142 pier’ (21.88) 
Suppose now that we assume that 
Pidt 2,441) = f Gadi, (21.89) 
Then, as before, # and s may be suppressed, and we have 
P(dt|H) = f (t)dt, (21.90) 
and hence, by comparison with (21.88), 
Pura Hye eae (21.91) 


We can then proceed to find limits to ¢, given # and s, in the usual way. Jeffreys 
emphasizes, however, that this depends on a new postulate expressed by (21.89) which, 
though natural, is not trivial. It amounts to an assumption that if we are comparing 
different distributions, samples from which give different *’s and_s’s, the scale of the 
distribution of ~ must be taken proportional to s and its mean displaced by the differ- 
ence of sample means. 


21.30 Ina similar way it will be found that to arrive at the Fisher—Behrens distri- 
bution it is necessary to postulate that 


P {dt,, dt, | £1, %o, $3, 82, H} = f,(t1) fo (te) dt, dt. (21.92) 
Jeffreys’ derivation of the Fisher-Behrens form from Bayes’ theorem would be as 


follows : 
The prior probability of du,du.,do,do,|H is 


P (da dage, doo Hy ees 


019% 


The likelihood (denoting the data by D) is 


1 n 2 ee = 
P{D| uy, be, O14, 99, HT} ox ot gt exp | — Fi (iP +9} — Ff ea 8)+ 9 
i 


152 THE ADVANCED THEORY OF STATISTICS 
Hence, by Bayes’ theorem, 


P{du,du,do,do,| D, H} = J 


gutt gett 
exp | — 5 (ui i)? + SS} 55 (Ma) + 8 
p 20% My— xy 1 2a? He—Xo)" + S25 |- 


Integrating out the values of o, and o., we find for the posterior distribution of ~, and 
ft, a form which is easily reducible to (21.75). 


Discussion 


21.31 There has been so much controversy about the various methods of estima- 
tion we have described that, at this point, we shall have to leave our customary objective 
standpoint and descend into the arena ourselves. The remainder of this chapter is 
an expression of personal views. We think that it is the correct viewpoint ; and it 
represents the result of many years’ silent reflexion on the issues involved, a serious 
attempt to understand what the protagonists say, and an even more serious attempt to 
divine what they mean. Whether it will command their approval is more than we 
can conjecture. 


21.32 We have, then, to examine three methods of approach: confidence inter- 
vals, fiducial intervals and Bayesian intervals. We must not be misled by, though we 
may derive some comfort from, the similarity of the results to which they lead in certain 
simple cases. We shall, however, develop the thesis that, where they differ, the basic 
reason is not that one or more are wrong, but that they are consciously or unconsciously 
either answering different questions or resting on different postulates. 


21.33 It will be simplest if we begin with the Bayes—Jeffreys approach. If it be 
granted that probability is a measure of belief, or an undefined idea obeying the usual 
postulates, Bayes’ theorem is unexceptionable. We have to recognize, however, that 
by abandoning an attempt to base our probability theory on frequencies of events, we 
have lost something in objectivity. 

The second hurdle to be taken is the acceptance of rules expressing prior prob- 
ability distributions. Jeffreys has very persuasively argued for the rules referred to 
above and nothing better has been proposed. At the same time there seems to be 
something arbitrary, for example, in requiring the prior distribution of a parameter 
which may range from — 00 to + 00 to be du, whereas if it varies only over the range 
0 to oo it should have a prior distribution proportional to du/u. Sophisticated argu- 
ments concerning the distinction somehow fail to impress us as touching the root of 
the problem. 


21.34 It should also be noticed that we have applied the Bayes argument to cases 
where a set of sufficient statistics exists. This is not essential. If L is the Likelihood 
Function, we can always write 


P{O| x4, ... Xp, H} oc P(O|H)L(x,,...,%,|0,H) (21.93) 


and, given P(6|H), determine the posterior distribution of 6. From this viewpoint, 
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the only advantage of a small set of sufficient statistics is that it summarizes all the rele- 
vant information in the Likelihood Function in fewer statistics than the m sample values. 
As we have remarked previously, these sample values themselves always constitute 
a set of sufficient statistics, though in practice this may be only a comforting tautology. 


21.35 Confidence-interval theory is also general in the sense that the existence 
of a single sufficient statistic for the unknown parameter is a convenience, not a neces- 
sity. We have, however, noted that where no single sufficient statistic exists there 
may be imaginary or otherwise nugatory intervals in some cases at least, and we know of 
no case in which these difficulties appear in the presence of single sufficiency, so that 
confidence-interval theory is possibly not so free from the need for sufficiency as might 
appear ; but perhaps it would be better to say that where nested and simply connected 
intervals cannot be obtained, there are specia! difficulties of interpretation. 

The principal argument in favour of confidence intervals, however, is that they 
can be derived in terms of a frequency theory of probability without any assumptions 
concerning prior distributions such as are essential to the Bayes approach. ‘This, in 
our opinion, is undeniable. But it is fair to ask whether they achieve this economy 
of basic assumption without losing something which the Bayes theory possesses. Our 
view is that they do, in fact, lose something on occasion, and that this something may 
be important for the purposes of estimation. 


21.36 Consider the case where we are estimating the mean yw of a normal popula- 
tion with known variance. And let us suppose that we know that wu lies between 0 and 1. 
According to Bayes’ postulate, we should have 


exp { —F(u—ap| du 

1 n - 
| exp { —3 (nah du 

0 


and the problem of setting limits to 4, though not free from mathematical complexity, 
is determinate. What has confidence-interval theory to say on this point? It can 
do no more than reiterate statements like 
P{R—1-96/4/n < wb < &4+1-96/+/n} = 0°95. 

These are still true in the required proportion of cases, but the statements take no 
account of our prior knowledge about the range of w and may occasionally be idle. 
It may be true, but would be absurd, to assert —1 < w < 2 if we know already that 
0 <<. Ofcourse, we may truncate our interval to accord with the prior informa- 
tion. In our example, we could assert only that 0 < uw < 1: the observations would 
have added nothing to our knowledge. 

In fact, so it seems to us, confidence-interval theory has the defect of its principal 
virtue : it attains its generality at the price of being unable to incorporate prior know- 
ledge into its statements. When we make our final judgment about mu, we have to 
synthesize the information obtained from the observations with our prior knowledge. 


P(du|%) = (21.94) 
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Bayes’ theorem attempts this synthesis at the outset. Confidence theory leaves it until 
the end (and, we feel bound to remark, in most current expositions ignores the point 
completely). 


21.37. Fiducial theory, as we have remarked, has been confined by Fisher to the 
case where sufficient statistics are used, or, quite generally, to cases where all the in- 
formation in the Likelihood Function can be utilized. No systematic exposition has 
been given of the procedure to be followed when prior information is available, but there 
seems no reason why a similar method to that exemplified by equation (21.94) should 
not be used. ‘That is to say, if we derive the fiducial distribution f(j) over a general 
range but have the supplementary information that the parameter must lie in the range 
i, to 4, (within that general range), we modify the fiducial distribution by truncation to 


f(¥) / \ if (uu) du. 


21.38 One critical difficulty of fiducial theory is exemplified by the derivation of 
“ Student’s ”’ distribution in fiducial form given in 21.10. It appears to us that this 
particular matter has been widely misunderstood, except by Jeffreys. Since the 
“Student ” distribution gives the same result for fiducial theory as for confidence 
theory, whereas the two methods differ on the problem of two means, both sides seem 
to have sought for their basic differences in the second, not in the first. But in our 
view c’est le premier test qui cotiite. If the logic of this is agreed, the more general Fisher— 
Behrens result follows by a very simple extension. This is also evident from the Bayes— 
Jeffreys approach, in which (21.92) is an obvious extension of (21.90) for two inde- 
pendent samples. 

The question, as noted in 21.10, is whether, given the joint distribution of * and s 
(which are independent in the ordinary sense), we can replace d&ds by duda/o. It 
appears to us that this is not obvious and, indeed, requires a new postulate, just as 
(21.90) requires a new postulate. On this point, the paper by Yates (1939a) is explicit. 
A penetrating general discussion of the fiducial argument is given by Dempster (1964) 
and in the Symposium published in the Bulletin of the International Statistical Institute 
(1964, 40(2), pp. 833-939). 


Paradoxes and restrictions in fiducial theory 


21.39 If (t,,f,) are jointly sufficient for (0;,0,), we may write the alternative 
factorizations 
L(x | 03,02) © 2 (ty, t2| 61,42) = 21 (t1| ta, 91, Fs) 80 (to | O1, Fe) 
= £5 (to|t1, 91,02) 84 (t1| 1, 92). 
If ¢, and ¢, each depend on only one of the parameters, there is no difficulty, each 
statistic being singly sufficient for its parameter. 
More generally, we may distinguish two special structures of the sufficient statistics: 
(a) One of the statistics depends on only one parameter. The factorization becomes 


ith 
os L o gy (ty | te, 91, 92) Z2(to | A2)\ (21.95) 
or L x g3(te | ty, 91, 2) a(t | 03) f° 
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(b) One of the conditional distributions depends on ‘only one parameter, giving 


ae L cc gx(ts | te 91) Salts | 91» 0) aye 
or Lc go(te | t1, 2) £a(ts | 1, 90) [° : 

If the first line of (21.96) holds, t, is singly sufficient for 0, when 0, is known; if 
the second line holds, ¢, is singly sufficient for 6, when 6, is known. 

Either line of (21.95) or of (21.96) permits a joint fiducial distribution to be con- 
structed by first obtaining the fiducial distribution of one parameter from the factor 
in which it appears alone, and then obtaining the conditional fiducial distribution 
of the other parameter (the value of the first parameter being fixed) from the factor 
in which both parameters appear. ‘The product of these distributions is taken as the 
joint fiducial distribution, on the analogy of the multiplication theorem for probabilities. 
(21.95) and (21.96) were used in this way by Fisher (1956) and Quenouille (1958) 
respectively. 

It will be seen that in 21.10 and 21.26, both (21.95) and (21.96) held—this was 
possible because the sample mean (or difference of means) ¢, was distributed inde- 
pendently of the sample variance(s) f,. In general, however, even these special 
sufficiency structures are not enough to guarantee the uniqueness of the joint fiducial 
distribution, as Tukey (1957) and Brillinger (1962) showed by counter-examples. 
The non-uniqueness arises precisely because both lines of (21.95) (or of (21.96)) can 
hold simultaneously, and the joint fiducial distribution may depend on which line we 
use to construct it. See also Mauldon (1955) and Dempster (1963). 

We shall see in 23.37-9 that either of the sufficiency structures (21.95-6) ensures 
optimum properties for conditional tests under a certain condition. 

Fraser (1961a, b) discusses the relationship of fiducial inference and some invariance 
properties. See also Hora and Buehler (1966). 


21.40 Lindley (1958a) has obtained a simple yet far-reaching result which not 
only illuminates the relationship between fiducial and Bayesian arguments, but also 
limits the claims of fiducial theory to provide a general method of inference, consistent 
with and combinable with Bayesian methods. In fact, Lindley shows that the fiducial 
argument is consistent with Bayesian methods if and only if it is applied to a random 
variable x and a parameter 6 which may be (separately) transformed to u and t respec- 
tively so that t is a location parameter for u; and in this case, it is equivalent to a Bayesian 
argument with a uniform prior distribution for t. ‘The criticism applies equally to 
‘‘ confidence distributions ” so defined at the end of 20.6 above, in so far as they co- 
incide with fiducial distributions. 


21.41 Using (21.5), we write for the fiducial distribution of 6 (without confusion 
with the usual notation for the characteristic function) 


$.(9) = — 5 F(«|9), (21.97) 
while the posterior distribution for 9 given a prior distribution p(0) is, by Bayes’ 


theorem, 
n2(0) = POF (#19) / | PO)F (19), (21.98) 
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where f (x|6) = @F (x|0)/dx, the frequency function. Writing r(«) for the denomin- 
ator on the right of (21.98), we thus have 
_ (9) oF (x| 8) 
%,(0) = ape ee (21.99) 
If there is some prior distribution p (6) for which the fiducial distribution is equivalent 
to a Bayes posterior distribution, (21.97) and (21.99) will be equal, or 


— 5, F(x| 6) 
2 SS ee (21.100) 
= F (x6) ) 


(21.100) shows that the ratio on its left-hand side must be a product of a function of 6 
and a function of x. We rewrite it 
AF, 1 oF 
7 r(x) dx p(0) 00 
For given p(@) and r(x), we solve (21.101) for F. The only non-constant solution is 
F = G{R(x)—P(6)}, (21.102) 
where G is an arbitrary function and R, P are respectively the integrals of r, p with 
respect to their arguments. If we write u = R(x), t = P(@), (21.102) becomes 
F = G{u-t}, (21.103) 
so that t is a location parameter for u. Conversely, if (21.103) holds, (21.100) is satis- 
fied with w and t for x and 6 and p(t) a uniform distribution. Thus (21.103) is a 
necessary and sufficient condition for (21.100) to hold, i.e. for the fiducial distribution 
to be equivalent to some Bayes posterior distribution. 


(21.101) 


21.42 Now consider the situation where we have two independent samples, sum- 
marized by sufficient statistics x, y, from which to make an inference about 6. We 
can do this in two ways: 

(a) we may consider the combined evidence of the two samples simultaneously, 
and derive the fiducial distribution 4, (0) ; 

(b) we may derive the fiducial distribution ¢, (6) from the first sample above, and 
use this as the prior distribution in a Bayesian argument on the second sample, to pro- 
duce a posterior distribution 2,, , (0). 

Now if the fiducial argument is consistent with Bayesian arguments, (a) and (b) 
are logically equivalent and we should have ¢,,,(0) = 2, ,(6). 

Take the simplest case, where x and y have the same distribution. Since it admits 
a single sufficient statistic for 0, the parent frequency function is of the form (17.83), from 
which we may assume (cf. Exercise 17.14) that the distribution of x itself is of form 


F (x| 0) = f (x)g (6) exp (x8), (21.104) 
and similarly for y in the other sample. Moreover, in the combined samples, «+ is 
evidently sufficient for 6, and thus the combined fiducial distribution ¢,, , (9) is a func- 
tion of (x+y) and 0 only. We now ask for the conditions under which z,, , (0) is also 
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a function of (x+y) and 6 only. Since by Bayes’ theorem 


= 2) 
| (OF (916) a9 


if mt, (0) is a function of (x+) and 6 only, so also will be the ratio for two different 


values of 6 
Tex, y (9) = $x (9) f (y | ®) 21.105 
ray) $e(0)F (918) oe 
Thus (21.105) must be invariant under interchange of x and y. ees (21.104) in 
(21.105), we therefore have 


$2()8 (9) con ty (9 —6')} = by) 8 O) xn fx(6—0')}, 


$,(9')g (6°) py (8) 2 (8) 
so that 
$,(9) dy (A) _ 
5,8) 2,0) ~ exp {(x—y) (6—-8')} 
or 


0’) e-*"" 4, (6) e-* 
$.(0) = ee 
and if we regard 6’ and y as constants, we may write (21.106) as 
d,(0) = A(x). BO) e”, (21.107) 
where A and B are arbitrary functions. Using (21.97), (21.104) and (21.107), we have 


(21.106) 


0 
7!) 4,0) _ ABO (21.108) 


Frei FI” FeO 


But (21.108) is precisely the condition (21.100), for which we saw (21.103) to be neces- 
sary and sufficient. Thus we can have ¢,,,(0) = 2,,(0) if and only if x and 6 are 
transformable to (21.103) with 7 a location parameter for u, and p(t) a uniform distribu- 
tion. Thus the fiducial argument is consistent with Bayes’ theorem if and only if the 
problem is transformable into a location parameter problem, the prior distribution of 
the parameter then being uniform. An example where this is not so is given as 
Exercise 21.11. 

Lindley goes on to show that in the exponential family of distributions (17.83), 
the normal and the Gamma distributions are the only ones obeying the condition of 
transformability to (21.103): this explains the identity of the results obtained by 
fiducial and Bayesian methods in these cases (cf. Example 21.3). Sprott (1960, 1961) 
shows that these remain the only such distributions if x and y are differently distributed. 

Welch and Peers (1963), Welch (1965), and Peers (1965) examine the problem of 
correspondence of Bayesian and confidence intervals with special reference to asympto- 
tic solutions. ‘Thatcher (1964) examines this correspondence for binomial predictions. 
Geisser and Cornfield (1963) and Fraser (1964) display further difficulties with fiducial 
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distributions in the multivariate case. See also the I.S.I. Symposium cited at the 
end of 21.38. 

Fraser (1962) proposes a modification of the fiducial method which extends the 
range of its consistency with Bayesian methods. 


.21.43 Still another objection to fiducial theory is one which has already been 
mentioned in respect of the Bayes approach. It abandons a strict frequency approach 
to the problem of interval estimation. It is possible, indeed, as Barnard (1950) has 
shown, to justify the Fisher-Behrens solution of the two-means problem from a differ- 
ent frequency standpoint, but as he himself goes on to argue, the idea of a fixed “ refer- 
ence set,” in terms of which frequencies are to be interpreted, is really foreign to the 
fiducial approach. And it is at this point that the statistician must be left to choose 
between confidence intervals, which make precise frequency-interpretable statements 
which may on exceptional occasions be trivial, and the other methods, which forgo 
frequency interpretations in the interests of what are, perhaps intuitively, felt to be 
more relevant inferences. | 


EXERCISES 


21.1 If % is the mean of a sample of m values from 


7 — 
dF = bit a. oe} dx, 


207 


s? is equal tod (x—x)?/(n—1), and x is a further independent sample value, show that 


x—X n 
t=—— /—— 
S as 


is distributed in ‘‘ Student’s ”’ form with n—1 d.f. Hence show that fiducial limits for 


x are 
. n+1 
sae | , 

n 


where ft, is chosen so that the integral of ‘‘ Student’s ”’ form between —?, and ?¢, is an 
assigned probability 1—«. 
(Fisher, 1935b. ‘This gives an estimate of the next value when 7 values have 
already been chosen, and extends the idea of fiducial limits from parameters 
to variates dependent on them.) 


21.2 Showsimilarly that if asample of n, values gives mean x, and estimated variance 
s, the fiducial distribution of mean *, and estimated variance s’} in a second sample of 
Ny 1S : 

'(m—1) '(Ma—2) dx, ds’ 

§Sresly Xo ds 

AF o : 2 2 GS < —— 
ai Fit im a cae 
n,—1)s Ny — X 4 — %_)?———— 
1 1 2 2 1 2 ni +n,» 
Hence, allowing 7, to tend to infinity, derive the simultaneous fiducial distribution 
of uw and o. 
(Fisher, 19355) 
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21.3. If the cumulative binomial distribution is given by 
n 
GUY,2) = & (;) (1 a)" 
j=f\J 
show that f/n is sufficient for a and that 


hg (x) dx = — os = OF ae (1—2)"-4 dn 


is an admissible fiducial distribution of z. Show that 


2GG+tF) i in P 5 
ha dw ee an = (9 IU (1—2) dn 
is also admissible. 

Hence show how to determine 7, from hy and 2, from h,, such that the fiducial interval 
My <2 < mt, has at least the associated probability 1—«. 


(Stevens, 1950. The use of two fiducial distributions is necessitated by dis- 
continuity in the observed f. Compare 20.22 on the analogous difficulty in 
confidence intervals.) 


21.4 Let 113, ly, . . «5 lisn—1 be (n—1) linear functions of the observations which 
are orthogonal to one another and to X,, and let them have zero mean and variance oj. 
Similarly define 1o;, los,...,le,n—1 

Then, in two samples of size n from normal populations with equal means and 
variances o; and o3, the function 3 


(%;—%_) n* 
{& (115+ L23)*/(n—1) ¥ 


will be distributed as ‘‘ Student’s”’ t with n—1 degrees of freedom. Show how to set 
confidence intervals to the difference of two means by this result, and show that the 
solution (21.51) is a member of this class of statistics when 2,=n.. 


21.5 Given two samples of 2,, 7, members from normal populations with unequal 
variances, show that by picking , members at random from the n, (where n, < n.) 
and pairing them at random with the members of the first sample, confidence intervals 
for the difference of means can be based on ‘“ Student’s”’ distribution independently 
of the variance ratio in the populations. Show that this is equivalent to putting 
cy = O(@§ #7); = 14 = J) in (21.36), and hence that this is an inefficient solution of the 
two-means problem. 


21.6 Use the method of 21.23 to show that the statistic z of (21.21) is distributed 
approximately in ‘‘ Student’s”’ form with degrees of freedom given by (21.64). 


21.7 From Fisher’s F distribution (16.24), find the fiducial distribution of 8 = 07/03, 
and show that if we regard the ‘‘ Student’s ”’ distribution of the statistic (21.20) as the 
joint fiducial distribution of 6 and 6, and integrate out 0 over its fiducial distribution; we 
arrive at the result of 21.26 for the distribution of 2. 

(Fisher, 1939) 


21.8 Prove the statement in 21.16 to the effect that if ax+by = zg, where x and y 
are independent random variables and x, y, 2 are all y” variates, the constants a = b = 1. 


(Scheffé, 1944) 
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21.9 Show that if we take the first two terms in the expansion on the right of (21.71), 
(21.65) is, to order 1/n, the approximation of (21.21) given in 21.24, 1.e. a “ Student’s ”’ 
distribution with degrees of freedom (21.64). 


21.10 Show that for n, = mn, = n, the conditional distribution of the statistic z of 
(21.21) for fixed s,/s. is obtainable from the fact that 


2 $ 


a\ * /o3 


is distributed like ‘‘ Student’s ’ t with 2(n—1) degrees of freedom. 
(Bartlett, 1936) 


21.11 Show that if the distribution of a sufficient statistic x is 
62 


ae secant —x6 ff] > 
f (x | 6) re aca ; x > 0, 0, 
the fiducial distribution of 8 for combined samples with sufficient statistics x, yy, is 
—z0 


bz, y (0) = — [09 (222+ 4234 Let) 464(224 294 Le4)] 


(where z = x+¥), while that for a single sample is 
bn(0) = ee 1414-6) (145) 
7s *~. (641) 
(Note that the minus sign in (21.5) is unnecessary here, since F'(x|6) is an increasing 
function of 0.) Hence show that the Bayes posterior distribution from the second sample, 
using ¢,(0) as prior distribution, is 


Tz, y (0) oC e— 28 fa x(1+y)[1+(1+6)(14+4)], 


so that Ty, y(0) # z,y(0). Note that 2z,y(0) # My, x (8) also. 
(Lindley, 1958a) 


CHAPTER 22 


TESTS OF HYPOTHESES: SIMPLE HYPOTHESES 


22.1 We now pass from the problems of estimating parameters to those of testing 
hypotheses concerning parameters. Instead of seeking the best (unique or interval) 
estimator of an unknown parameter, we shall now be concerned with deciding whether 
some pre-designated value is acceptable in the light of the observations. 

In a sense, the testing problem is logically prior to that of estimation. If, for 
example, we are examining the difference between the means of two normal popula- 
tions, our first question is whether the observations indicate that there is any true 
difference between the means. In other words, we have to compare the observed 
differences between the two samples with what might be expected on the hypothesis 
that there is no true difference at all, but only random sampling variation. If this 
hypothesis is not sustained, we proceed to the second step of estimating the magnitude 
of the difference between the population means. 

Quite obviously, the problems of testing hypotheses and of estimation are closely 
related, but it is nevertheless useful to preserve a distinction between them, if only 
for expository purposes. Many of the ideas expounded in this and the following 
chapters are due to Neyman and E. S. Pearson, whose remarkable series of papers 
(1928, 1933a, b, 1936a, b, 1938) is fundamental. See also the monograph by Lehmann 
(1959). 


22.2 ‘The kind of hypothesis which we test in statistics is more restricted than the 
general scientific hypothesis. It is a scientific hypothesis that every particle of matter 
in the universe attracts every other particle, or that life exists on Mars; but these are 
not hypotheses such as arise for testing from the statistical viewpoint. Statistical 
hypotheses concern the behaviour of observable random variables. More precisely, 
suppose that we have a set of random variables x,,...,x*,. As before, we may 
represent them as the co-ordinates of a point (x, say) in the n-dimensional sample space, 
one of whose axes corresponds to each variable. Since x is a random variable, it has 
a probability distribution, and if we select any region, say w, in the sample space W, 
we may (at least in principle) calculate the probability that the sample point x falls 
in w, say P(x ew). We shall say that any hypothesis concerning P(x € w) is a statistical 
hypothesis. In other words, any hypothesis concerning the behaviour of observable 
random variables is a statistical hypothesis. 

For example, the hypothesis (a) that a normal distribution has a specified mean 
and variance is statistical; so is the hypothesis (b) that it has a given mean but un- 
specified variance ; so is the hypothesis (c) that a distribution is of normal form, both 
mean and variance remaining unspecified ; and so, finally, is the hypothesis (d) that 
two unspecified continuous distributions are identical. Each of these four examples 


implies certain properties of the sample space. Each of them is therefore translatable 
161 
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into statements concerning the sample space, which may be tested by comparison with 
observation. 


Parametric and non-parametric hypotheses 


22.3 It will have been noticed that in the examples (a) and (b) in the last paragraph, 
the distribution underlying the observations was taken to be of a certain form (the 
normal) and the hypothesis was concerned entirely with value of one or both of its 
parameters. Such a hypothesis, for obvious reasons, is called parametric. 

Hypothesis (c) was of a different nature. It may be expressed in an alternative 
way, since it is equivalent to the hypothesis that the distribution has all cumulants 
finite, and all cumulants above the second equal to zero (cf. Example 3.11). Now the 
term “ parameter ” is often used to denote a cumulant or moment of the population, 
in order to distinguish it from the corresponding sample quantity. This is an under- 
standable, but rather loose, usage of the term. ‘The normal distribution 


dF (x) = (2)-* exp fe 4 (=a) has /o 


has just two parameters, 4 and o. (Sometimes it is more convenient to regard ju and 
o? as the parameters, this being a matter of convention. We cannot affect the number 
of parameters by minor considerations of this kind.) We know that the mean of the 
distribution is equal to yz, and the variance to o?, but the mean and variance are no 
more parameters of the distribution than are, say, the median (also equal to y), the 
mean deviation about the mean (= o(2/z)!), or any other of the infinite set of constants, 
including all the moments and cumulants, which we may be interested in. By “ para- 
meters,” then, we refer to a finite number of constants appearing in the specification 
of the probability distribution of our random variable. | 

With this understanding, hypothesis (c), and also (d), of 22.2 are non-parametric 
hypotheses. We shall be discussing non-parametric hypotheses at length in Chapters 
30 onwards, but most of the theoretical discussion in this and the next chapter is equally 
applicable to the parametric and the non-parametric case. However, our particularized 
discussions will mostly be of parametric hypotheses. 


Simple and composite hypotheses 

22.4 There is a distinction between the hypotheses (a) and (b) in 22.2. In (a), 
the values of all the parameters of the distribution were specified by the hypothesis ; 
in (b) only a subset of the parameters was specified by the hypothesis. This distinction 
is important for the theory. To formulate it generally, if we have a distribution depend- 
ing upon J parameters, and a hypothesis specifies unique values for k of these para- 
meters, we call the hypothesis simple if k = 1 and we call it composite if k</. In 
geometrical terms, we can represent the possible values of the parameters as a region 
in a space of J dimensions, one for each parameter. If the hypothesis considered 
selects a unique point in this parameter space, it is a simple hypothesis ; if the hypothesis 
selects a sub-region of the parameter space which contains more than one point, it is 
composite. 
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1—k is known as the number of degrees of freedom of the hypothesis, and k as the 
number of constraints imposed by the hypothesis. This terminology is obviously 
related to the geometrical picture in the last paragraph. 


Critical regions and alternative hypotheses 


22.5 To test any hypothesis on the basis of a (random) sample of observations, we 
must divide the sample space (i.e. all possible sets of observations) into two regions. 
If the observed sample point x falls into one of these regions, say w, we shall reject 
the hypothesis ; if x falls into the complementary region, W—w, we shall accept the 
hypothesis. w is known as the critical region of the test, and ey is called the 
acceptance region. 

It is necessary to make it clear at the outset that the rather peremptory terms 
“reject”? and “accept,” used of a hypothesis under test in the last paragraph, are 
now conventional usage, to which we shall adhere, and are not intended to imply that 
any hypothesis is ever finally accepted or rejected in science. If the reader cannot 
overcome his philosophical dislike of these admittedly inapposite expressions, he will 
perhaps agree to regard them as code words, “ reject’ standing for ‘“‘ decide that the 
observations are unfavourable to” and “ accept’ for the opposite. We are concerned 
to investigate procedures which make such decisions with calculable probabilities of 
error, in a sense to be explained. 


22.6 Now if we know the probability distribution of the observations under the 
hypothesis being tested, which we shall call Hy, we can determine w so that, given 
H,, the probability of rejecting H, (i.e. the probability that x falls in w) is equal to a 
pre-assigned value «, 1.e. 

Prob fx ew| Hy} = a. (22.1) 

If we are dealing with a discontinuous distribution, it may not be possible to satisfy 
(22.1) for every « in the interval (0, 1). The value « is called the size of the test.(*) 
For the moment, we shall regard « as determined in some way. We shall discuss the 
choice of « later. 

Evidently, we can in general find many, and often even an infinity, of sub-regions 
w of the sample space, all obeying (22.1). Which of them should we prefer to the 
others? ‘This is the problem of the theory of testing hypotheses. ‘T'o put it in every- 
day terms, which sets of observations are we to regard as favouring, and which as 
disfavouring, a given hypothesis ? 


22.7 Once the question is put in this way, we are directed to the heart of the 
problem. For it is of no use whatever to know merely what properties a critical region 
will have when H, holds. What happens when some other hypothesis holds? In 
other words, we cannot say whether a given body of observations favours a given 
hypothesis unless we know to what alternative(s) this hypothesis is being compared. 


(*) The hypothesis under test is often called ‘‘ the null hypothesis,’ and the size of the test 
““ the level of significance.”’ We shall not use these terms, since the words “‘ null ”’ and “ signifi- 
cance”? can be misleading. 
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It is perfectly possible for a sample of observations to be a rather ‘“ unlikely” one if 
the original hypothesis were true ; but it may be much more “ unlikely ” on another 
hypothesis. If the situation is such that we are forced to choose one hypothesis or 
the other, we shall obviously choose the first, notwithstanding the “ unlikeliness ” of 
the observations. ‘The problem of testing a hypothesis is essentially one of choice 
between it and some other or others. It follows immediately that whether or not we 
accept the original hypothesis depends crucially upon the alternatives against which 
it is being tested. | 


The power of a test 

22.8 ‘The discussion of 22.7 leads us to the recognition that a critical region (or, 
synonymously, a test) must be judged by its properties both when the hypothesis tested 
is true and when it is false. Thus we may say that the errors made in testing a statistical 
hypothesis are of two types: 

(1) We may wrongly reject it, when it is true ; 

(II) We may wrongly accept it, when it is false. 


These are known as Type I and Type II errors respectively. The probability of 
a Type I error is equal to the size of the critical region used, «. The probability of a 
Type II error is, of course, a function of the alternative hypothesis (say, H,) con- 
sidered, and is usually denoted by f. ‘Thus 
Prob {x «e W—w| H,} = B 
or 
Prob {x ew| H,} = 1-8. (22.2) 
This complementary probability, 1—, is called the power of the test of the hypothesis 
H, against the alternative hypothesis H,. The specification of H, in the last sentence 
is essential, since power is a function of Hj. 


Example 22.1 


Consider the problem of testing a hypothetical value for the mean of a normal 
distribution with unit variance. Formally, in 


dF (x) = (2x)-* exp {—3(x—,)*} dx, —-O< xX < Ow, 
we test the hypothesis 
i; i — fo- 
This is a simple hypothesis, since it specifies F(x) completely. ‘The alternative hypo- 
thesis will also be taken as the simple 
Hy: = fy > Mo. 

Thus, essentially, we are to choose between a smaller given value (uw) and a larger 
(1) for the mean of our distribution. 

We may represent the situation diagrammatically for a sample of m = 2 observa- 
tions. In Fig. 22.1 we show the scatters of sample points which would arise, the 
lower cluster being that arising when H, is true, and the higher when A, is true. 

In this case, of course, the sampling distributions are continuous, but the dots 
indicate roughly the condensations of sample densities around the true means. 
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Sie Q 
Fig. 22.1—Critical regions for m = 2 (see text) 


To choose a critical region, we need, in accordance with (22.1), to choose a region 
in the plane containing a proportion « of the distribution on Hy. One such region 
is represented by the area above the line PQ, which is perpendicular to the line AB 
connecting the hypothetical means. (A is the point (o, “), and B the point (4, /41).) 
Another possible critical region of size « is the region CAD. 

We see at once from the circular symmetry of the clusters that the first of these 
critical regions contains a very much larger proportion of the H, cluster than does the 
CAD region. The first region will reject H, rightly, when H, is true, in a higher 
proportion of cases than will the second region. Consequently, its value of 1 — B 
in (22.2), or in other words its power, will be the greater. 


22.9 Example 22.1 directs us to an obvious criterion for choosing among critical 
regions, all satisfying (22.1). We seek a critical region w such that its power, defined 
at (22.2), is as large as possible. Then, in addition to having controlled the probability 
of Type I errors at «, we shall have minimized the probability of a Type II error, £. 
This is the fundamental idea, first expressed explicitly by J. Neyman and E. S. Pearson, 
which underlies the theory of this and following chapters. 

A critical region, whose power is no smaller than that of any other region of the 
same size for testing a hypothesis Hy against the alternative Hj, is called a best critical 
region (abbreviated BCR), and a test based on a BCR is called a most powerful 
(abbreviated MP) test. 
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Testing a simple H, against a simple H, 

22.10 If we are testing a simple hypothesis against a simple alternative hypothesis, 
i.e. choosing between two completely specified distributions, the problem of finding a 
BCR of size « is particularly straightforward. Its solution is given by a lemma due 
to Neyman and Pearson (1933b), which we now prove. 

As in earlier chapters, we write L(«| H,) for the Likelihood Function given the 
hypothesis H(t = 0, 1), and write a single integral to represent n-fold integration 
in the sample space. Our problem is to maximize, for choice of w, the integral form 
of (22.2), | | 


ee | L(x | H,) dx, (22.3) 
subject to the condition (22.1), which we here write 
| L(x | H,)de = «. (22.4) 


The critical region w should obviously include all points x for which L(x | Hy) = 0, 
L(x | H,) >0; these points contribute nothing to ves integral in (22.4). For the other 
points in w, we may rewrite (22.3) as 


L(x | Hy) 
= =| 1 ix} oe 22.5 
p tien | th 
so that we have to choose w to maximize the expectation, when H, holds, of Tels | a 
0 
in w. Clearly this will be done if and only if w consists of that fraction « of the sample 


L(x | Hy) 


"i'l! ‘Thus the BCR consists of the points 
L(x | Hy) : 


space containing the largest values of 


in W satisfying 
L (x | Ho) o) 
L(x | Hy) ~ 
when H, holds. ‘To any constant k, in (22.6) there corresponds a value « for the 
size (22.4). If the x’s are continuously distributed, we can also find a k, for any «. 


(22.6) 


22.11 If the distribution of the x’s is not continuous, we may effectively render 
it so by a randomization device (cf. 20.22). In this case, 


L(x | Ho) 
L (x | H;) 


with some non-zero probability p, while in general, owing to discreteness, we can only 
choose k,, in (22.6) to make the size of the test equal to «— (0 < gq < p). ‘To con- 
vert the test into one of exact size «, we simply arrange that, whenever (22.7) holds, 
we use a random device (e.g. a table of random sampling numbers) so that with prob- 
ability g/p we reject Ho, while with probability 1—(q/p) we accept Hy. ‘The over- 
all probability of rejection will then be (.—g)+p.q/p = «, as required, whatever 


a (22.7) 
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the value of « desired. In this case, the BCR is clearly not unique, being subject 
to random sampling fluctuation. 


Example 22.2 
Consider again the normal distribution of Example 22.1, 
dF (x) = (2x) *exp[—3(x—p)*]dx, -O<x< om, (22.8) 
where we are now to test Hy): u = Mo against the alternative Hy: u = uy(A4My). 
We have 


L(x | H,) = (2x)-" exp | 3 = (%-n)*}) 1 = 0, 1, 
ju 


= (27) exp _ 5 {s? + (%—u,)?} | (22.9) 


where %, s? are the sample mean and variance respectively. ‘Thus, for the BCR, we 
have from (22.6) 


L (x | Hi) Ss nN = - x— 2 
olny 7% [5 {myo}? | 
=e E { (ios) 28+ (ud — 8) < hy (22.10) 
(uo— Ma) < 48-18) + log he. (22.11) 


Thus, given fo, 4; and a, the BCR is determined by the value of the sample mean % 
alone. This is what we might have expected from the fact (cf. Examples 17.6, 17.15) 
that # is a MVB sufficient statistic for u. Further, from (22.11), we see that if uw) > uy 
the BCR is 


% < (uot Ms) +log hy/ {n (sto— 1) }s (22.12) 
while if wo < py it is 

>} (Mot Mi) —logh,/{n(us—Mo) }, (22.13) 
which is again intuitively reasonable: in testing the hypothetical value wy against a 
smaller value 1, we reject “, if the sample mean falls below a certain value, which 
depends on «, the size of the test ; in testing mw» against a larger value ,, we reject Mo 
if the sample mean exceeds a certain value. 


22.12 A feature of Example 22.2 which is worth remarking, since it occurs in a 
number of problems, is that the BCR turns out to be determined by a single statistic, 
rather than by the whole configuration of sample values. This simplification permits 
us to carry on our discussion entirely in terms of the sampling distribution of that 
statistic, called a “test statistic,” and to avoid the complexities of n-dimensional 
distributions. 


M 
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Example 22.3 


In Example 22.2, we know (cf. Example 11.12) that whatever the value of yu, % is 
itself exactly normally distributed with mean mw and variance 1/n. Thus, to obtain 
a test of size « for testing wy against “, > Mo, we determine *, so that 


Re {3-1} igs & 


Writing 
G(s) = | 2n)texp(-by*)ay, (22.14) 
we have, for uy > fo, 
Ky = Uotd,/n* (22-15) 
where 
G(-—d,) = «. (22.16) 


For example, with wy) = 2, n = 25 and « = 0-05, we find, from a table of the 
normal integral, 
do.os = 1:6449, 
so that, from (22.15) 
%, = 24+1-6449/5 = 2:3290. 


In this simple example, the power of the test may be written down explicitly. It is 


fey ee 
{ : (=) exp { * (em) \ as 1p. (22.17) 
Using (22.15), we may standardize this integral to 
1—G {n? (uo—py) +d, } = G {n* (uy — Mo) — aa }s (22.18) 


since G(x) = 1—G(—.x) by symmetry. From (22.18) it is clear that the power is a 
monotone increasing function both of m, the sample size, and of (u,— 19), the difference 
between the hypothetical values between which the test has to choose. 


Example 22.4 
As a contrast, consider the Cauchy distribution 


dx 
fie bee {1+(«—0)2y 


and suppose that we wish to test 


—a = Fe So, 


Wit S03 
against 

A: G1. 
For simplicity, we shall confine ourselves to the case m = 1. According to (22.6), the 
BCR is given by 


L(x | Ho) _ 1+(*-1)? — h 
L(x|H,) +a <"* 
This is equivalent to 
x* (k,—1)+2x+(k,—2) 2 0. (22.19) 
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The form of the BCR thus defined depends upon the value of « chosen. If, to 
take a simple case, we had k, = 1, (22.19) reduces to x > 4, so that we should reject 
6 = 0 in favour of 6 = 1 whenever the observed x was closer to 1 than to 0. If, on the 
other hand, we take k, = 0-5, (22.19) becomes 

w@—4x4+3<0 or (x-2)? <1, 
which holds when 1 < x < 3. This is the critical region. 
Since the Cauchy distribution is a ““Student’s” distribution with 1 degree of freedom, 


and accordingly F(x) = 5+ are tanx, we may calculate the size of each of the two 
7 


tests above. 
For k, = 1, the size is 
prob(¢ > 3) = 0-352, 
while for k, = 0:5 the size is 
prob(1 < ¢ < 3) = 0-148. 
This table may also be used to determine the powers of these tests. We leave 
this to the reader as Exercise 22.4 at the end of this chapter. 


22.13 ‘The examples we have given so far of the use of the Neyman—Pearson lemma 
have related to the testing of a parameter value for some given form of distribution. 
But, as will be seen on inspection of the proof in 22.10-22.11, (22.6) gives the BCR 
for any test of a simple hypothesis against a simple alternative. For instance, we might 
be concerned to test the form of a distribution with known location parameter, as in 
the following example. 


Example 22.5 
Suppose that we know that the mean of a distribution is equal to zero, but wish to 
investigate its form. We wish to choose between the alternative forms 
Hy: dF = (2x)-* exp (— 4x’) dx, 
H,: dF = sexp(— |x| )dx. 
For simplicity, we again take sample size n = 1. 
Using (22.6), the BCR is given by 
2 
Le Hey = (q) PCL -B) < be 
Thus we reject Hy when 


|x| —3x? < log {he (5) } = Cy 


The BCR therefore consists of extreme positive and negative values of the observation, 


supplemented, if k, > : (i.e. ¢, > 0), by values in the neighbourhood of x = 0. 


\ oti Se e & BO, 


The reader should verify this by drawing a diagram. 


BCR and sufficient statistics 
22.14 If both hypotheses being compared refer to the value of a parameter 0, 
and there is a sufficient statistic ¢t for 0, it follows from the factorization of the Likelihood 
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Function at (17.68) that (22.6) becomes 


L(x | 00) _ g(t 90) 

E(e19;) ~ (10) a 
so that the BCR is a function of the value of the sufficient statistic ¢, as might be 
expected. We have already encountered an instance of this in Example 22.2. (The 
same result evidently holds if 6 is a set of parameters for which f is a jointly sufficient 
set of statistics.) Exercise 22.13 shows that the ratio of likelihoods on the left of (22.20) 
is itself a sufficient statistic, so that the BCR is a function of its value. 

However, it will not always be the case that the BCR will, as in Example 22.2, 
be of the form ¢ > a, or t < b,: Example 22.4, in which the single observation x is a 
sufficient statistic for 0, is a counter-example. Inspection of (22.20) makes it clear that 
the BCR will be of this particularly simple form if g(t|69)/g(t| 01) is a non-decreasing 
function of ¢ for 0, > 0,. This will certainly be true if 


o2 
a b ] e 1 
| 5p 5 logs (é| 8) > 0 (22.21) 
a condition which is satisfied by nearly all the distributions met with in statistics. 


Example 22.6 
For the distribution 


dF (x) = { 


the smallest sample observation x() is sufficient for 6 (cf. Example 17.19). For a 
sample of m observations, we have, for testing 0 against 0, > 4, 


L(x | ry) ae CO if X(1) < 0, 
L(x|6,) exp {n(@)—96,) } otherwise. 
Thus we require for a BCR 


exp {—(x—0) }dx, 0<x< o, 
0 elsewhere, 


exp {n(0)—6;) } < ky. (22.22) 


Now the left-hand side of (22.22) does not depend on the observations at all, being a 

constant, and (22.22) will therefore be satisfied by every critical region of size « with 

xa) 2 0,. Thus every such critical region is of equal power, and is therefore a BCR. 
If we allow 6, to be greater or less than 0), we find 


ee) if 09 < Xa) < 9,, 
L(x|65) _ Jexp {#(8>—6;)} > 1 if xq) 2 Oy > A, 
L(x|6;) — )exp {n(0,—9,)} < 1 if xq) > 8, > Oo, 
0 if 0; < Xa) < 4%. 


Thus the BCR is given by 
(%1)— 9) = 0, (x4) — 99) > Ca 


The first of these events has probability zero on Hy. The value of c, is determined 
to give probability « that the second event occurs when Hy is true. 
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Estimating efficiency and power : 

22.15 ‘The use of a statistic which is efficient in estimation (cf. 17.28-9) does 
not imply that a more powerful test will be obtained than if a less efficient estimator 
had been used for testing purposes. ‘This result is due to Sundrum (1954). 

Let ¢, and ¢, be two asymptotically normally distributed estimators of a parameter 6, 
and suppose that, at least asymptotically, 

E(t;) = E(t,) = 8, 

War (igs 0. = 0) =-27 44, p= 7. 

var (¢;|0 = 6,) = o7,;. 
We now test H,: 6 = 0 against H,:0=6, > 09. Exactly as at (22.15) in Example 
22.3, we have the critical regions, one for each test, 

t; > 09+ d, Cio, += T, 2, (22.23) 
where d, is the normal deviate defined by (22.14) and (22.16). The powers of the 
tests are (generalizing (22.18) which dealt with a case where oj) = 0;3) 

1 By = G{ Santa Satie), (22.24) 
a1 

Since G(x) is a monotone increasing function of its argument, ¢, will provide a more 
powerful test than ¢, if and only if, from (22.24), 


(0,—9) —4, 619 = (0,—05)—dyF29 


O11 O21 
16, if 
= ees Ga) (22.25) 
Oo1— 911 
If we put FE; = 0;/0,;(7 = 0, 1), (22.25) becomes 
oj=ogseab C= a (22.26) 
1 


E,, E, are simply powers (usually square roots) of the estimating efficiency of ft, relative 
to t, when H, and H, respectively hold. Now if 

Ey =f, > 4, (22.27) 
the right-hand side of (22.25) is zero, and (22.26) always holds. ‘Thus if the estimating 
efficiency of ¢, exceeds that of t, by the same amount on both hypotheses, the more 
efficient statistic ¢,; always provides a more powerful test, whatever value « or 0,—6 
takes. But if 3 

E,>&k,2 1 (22.28) 
we can always find a test size « small enough for (22.26) to be falsified. Hence, the 
less efficient estimator t, will provide a more powerful test if (22.28) holds, i.e. if its 
relative efficiency is greater on H, than on H,. Alternatively if Ey) > FE, > 1, we ean 
find « Jarge enough to falsify (22.26). If E, is continuous in 0, E, > Ey as 0; > 4, 
so that (22.26) is not falsified in the immediate neighbourhood of 6p. 

This result, though a restrictive one, is enough to show that the relation between 

estimating efficiency and test power is rather loose. In Chapter 25 we shall again 
consider this relationship when we discuss the measurement of test efficiency. 
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Example 22.7 

In Examples 18.3 and 18.6 we saw that in estimating the parameter p of a standardized 
bivariate normal distribution, the ML estimator p is a root of a cubic equation, with 
large-sample variance equal to (1—p?)?/{n(1+p ?) }, while the sample correlation 
coefficient r has large-sample variance (1—p*)?/n. Both estimators are consistent and 
asymptotically normal, and the ML estimator is efficient. In the notation of 22.15, 


E = (1+ ?). 
_If we test Hy: p = 0 against H,: p = 0-1, we have E, = 1, and (22.26) simplifies to 
ee ee: (;) (22.29) 


If we choose m to be, say, 400, so that the normal approximations are adequate, we 
require 
a; +2 

to falsify (22.29). ‘This corresponds to « < 0-023, so that for tests of size <0-023, the 
inefiicient estimator r has greater power asymptotically in this case than the efficient /. 
Since tests of size 0-01, 0-05 are quite commonly used, this is not merely a theoretical 
example: it cannot be assumed in practice that “‘ good ”’ estimators are “‘ good ”’ test 
statistics. 


Testing a simple H, against a class of alternatives 

22.16 So far we have been discussing the most elementary problem, where in 
effect we have only to choose between two completely specified competitive hypotheses. 
For such a problem, there is a certain symmetry about the situation—it is only a matter 
of convention or convenience which of the two hypotheses we regard as being ‘“‘ under 
test’ and which as “ the alternative.’”’ As soon as we proceed to the generalization 
of the testing situation, this symmetry disappears. 

Consider now the case where H, is simple, but H, is composite and consists of a 
class of simple alternatives. ‘The most frequently occurring case is the one in which 
we have a class 2 of simple parametric hypotheses of which H, is one and H, com- 
prises the remainder ; for example, the hypothesis Hy may be that the mean of a certain 
distribution has some value mw, and the hypothesis H, that it has some other value 
unspecified. 

For each of these other values we may apply the foregoing results and find, for 
given a, corresponding to any particular member of H, (say H,) a BCR w,._ But this 
region in general will vary from one H; to another. We obviously cannot determine a 
different region for all the unspecified possibilities and are therefore led to inquire 
whether there exists one BCR which is the best for all H; in H,. Such a region is 
called Uniformly Most Powerful (UMP) and the test based on it a UMP test. 


22.17 Unfortunately, as we shall find below, a UMP test does not usually exist 
unless we restrict our alternative class 2 in certain ways. Consider, for instance, the 
case dealt with in Example 22.2. We found there that for u, < mw, the BCR for a 


simple alternative was defined by 
BX ay. (22.30) 
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Now so long as 4; < fo, the regions determined by (22.30) do not depend on yw, and 
can be found directly from the sampling distribution of # when the test size, «, is given. 
Consequently the test based on (22.30) is UMP for the class of hypotheses that “4 < [Upo. 

However, from Example 22.2, if uw, > Mo, the BCR is defined by * > b,. Here 
again, if our class 2 is confined to the values of uw, greater than wy the test is UMP. 
But if «, can be either greater or less than uw», no UMP test is possible, for one or other 
of the two UMP regions we have just discussed will be better than any compromise 
region against this class of alternatives. 


22.18 Wenow prove that fora simple H,: 6 = 0) concerning a parameter 6 defining 
a class of hypotheses, no UMP test exists in general against an interval including positive 
and negative values of 0 — 6), under regularity conditions, in particular that the derivative 
of the likelihood with respect to 0 is continuous in 0. 

We expand the Likelihood Function in a ‘Taylor series about 6, getting 


L(x|0,) = L(x|69)+(6,—65) L' («| 6*) SPRSS 
where 6* is some value in the interval (6,, 65). For the BCR, if any, we must have, 
from (22.6) and (22.31), 

L(x|0s) _ 4, (Ox—O0)L' (w| 0¥) 


>-h.(0,). 22,32 
L(x) L («| 0.) oe a 
Thus the BCR is defined by 
L' (x | *) 5 
> ee 22..33 
L(x | 0.) oe — 
eh gst (22.34) 


Now consider what happens as 0, approaches 9). 6* necessarily does the same, and 
in the immediate neighbourhood of 69, (22.33-4) become, in virtue of the continuity 
of Lin §, 


L'(x| 69) _ = | 8° 


Sa. 0: 2, (22.35 
L(x | 6) a0 | 6, 2 ( ) 


od. 6-8. (22.36) 


We thus establish, incidentally, that in the immediate neighbourhood of 6, one- 
log L 
00 
interval result obtained in 20.17. : 
Our main result now follows at once. If we are considering an interval of alter- 


natives including positive and negative values of (0,;—06 ), (22.35) and (22.36) cannot 
both hold (and there can therefore be no BCR) unless 


sided tests based on ° | are UMP. ‘This is a testing analogue of the confidence 
0=0, 


° me =| oe (22.37) 
00 joe, 
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(22.37) is the essential condition for the existence of a two-sided BCR. It cannot be 
satisfied if (17.18) holds (e.g. for distributions with range independent of 8) unlessthe con- 


stant is zero, for the condition E : = =| = 0 with (22.37) implies E = =| = 0. 
0=6, 
In Example 22.6, we have already encountered an instance where a two-sided BCR 


: 
= n exactly, so 
0=6, 


exists. ‘The reader should verify that for that distribution E ~ 
that (22.37) is satisfied. 


UMP tests with more than one parameter 

22.19 If the distribution considered has more than one parameter, and we are 
testing a simple hypothesis, it remains possible that a common BCR exists for a class of 
alternatives varying with these parameters. ‘The following two examples discuss the 
case of the two-parameter normal distribution, where we might expect to find such a 
BCR, but where none exists, and the two-parameter exponential distribution, where 
a BCR does exist. 


Example 22.8 


Consider the normal distribution with mean mw and variance o?. The hypothesis 
to be tested is 
| Fy: 6 = fo, F = Op, 


and the alternative, H,, is restricted only in that it must differ from Hy. For any 
such 
Hy:u = my o =o, 


the BCR is, from (22.6), given by 
L(x|Ho) _ (61\" | -4 y (=ahe s » (Fey) <k 
L(x|H,) (: a Oo 7 — 
This may be written in the form 
1-2 EMS 4a R= 2 ore Se 
Ss ae = —s k, 
‘a a eee 


where £, s? are sample mean and variance respectively. If o) # o,, we may further 
simplify this to 


5-3 
e 3) Ee wee (22.38) 


See | 


where c, is independent of the observations, and 


We have already dealt with the case o)=0, in Example 22.2, where we took them 
both equal to 1. 
(22.38), when a strict equality, is the equation of a hypersphere, centred at 
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X, = %,=...=%,=p. Thus the BCR is always bounded by a hypersphere. 
When o, > do, (22.38) yields 
x (x =p)? 2 Ay, 
so that the BCR lies outside the sphere; when o, < oo, we find from (22.38) 
u(x—p)? < b,, 
and the BCR is inside the sphere. 

Since p is a function of w, and o,, it is clear that there will not generally be a common 
BCR for different members of H,, even if we limit ourselves by o, < o and m, < [Mo 
or similar restrictions. We may illustrate the situation by a diagram of the (%, s) 
plane, for 

E(x—p)? = U(x 4)?-+n(#—p)2 
n {s?+(%—p)?}, (22.39) 
and for (22.39) constant, we obtain a circle with centre (p, 0) and fixed radius a function 
of a. 
Fig. 22.2 (adapted from Neyman and Pearson, 1933b) illustrates some of the contours 


a 


([o.0) 


Fig. 22,.2—Contours of constant likelihood ratio R (see text) 


for particular cases. A single curve, corresponding to a fixed value of k in (22.37), 
is shown in each case. 

Cases (1) and (2): o, = o) and p = +0. The BCR lies on the right of the line 
(1) if wy > uy and on the left of (2) if wy < >. This is the case discussed in 
Example 22.2. 

Case (3): 01 < 09, say o, = $09. Then p = uyt+4(u,—Mo) and the BCR lies 
inside the semicircle marked (3). 

Case (4): o, < og and uw, = fo. The BCR is inside the semicircle (4). 

Case (5): 0, > o» and uw, = fy. The BCR is outside the semicircle (5). 
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There is evidently no common BCR for these cases. ‘The regions of acceptance, 
however, may have a common part, centred round the value (to, oo), and we should 
expect them to do so. Let us find the envelope of the BCR, which is, of course, the 
same as that of the regions of acceptance. The likelihood ratio is differentiated with 
respect to 4, and to o,, and these derivatives equated to zero. This gives precisely 
the ML solutions (cf. Example 18.11) 


A a 
My = &, 
A 
oO 


ae K—Uy\? _ 2 Oy\" 
ar eras ae 


ss 2 2 
(7540s loe( 53) as = 1 Slog Fey (22.40) 


The dotted curve in Fig. 22.2 shows one such envelope. It touches the boundaries 
of all the BCR which have the same k (and hence are not of the same size «). ‘The 
space inside may be regarded as a “ good ” region of acceptance and the space outside 
according as a good critical region. There is no BCR for all alternatives, but the 
regions determined by envelopes of likelihood-ratio regions effect a compromise by 
picking out and amalgamating parts of critical regions which are best for individual 
alternatives. 


or 


Example 22.9 
To test the simple hypothesis 


Hy = Gy oF = Se 
against the alternative 

H,:0 = 0, < 09, = 0; < 0, 
for the distribution 


aF = exp{ -(=)} gi > (22.41) 
a 
From (22.6), the BCR is given by 
Ly _ (% eo _n(%—Oo) , n(%—91) ch 
Oo 01 a 


(s = Oo 
so that whatever the values of 0,, o, in H,, the BCR is of form 
deyace 
ey 0, 2 <= (22.42) 
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The first of these events has probability zero when Hy, holds. ‘There is therefore a 
common BCR for the whole class of alternatives H,, on which a UMP test may be based. 
We have already effectively dealt with the case 0, = 0) in Example 22.6. 


UMP tests and sufficient statistics 

22.20 In 22.14 we saw that in testing a simple parametric hypothesis against a 
simple alternative, the BCR is necessarily a function of the value of the (jointly) sufficient 
statistic for the parameter(s), if one exists. In testing a simple H, against a composite 
H, consisting of a class of simple parametric alternatives, it evidently follows from 
the argument of 22.14 that if a common BCR exists, providing a UMP test against H,, 
and if ¢ is a sufficient statistic for the parameter(s), then the BCR will be a function of ft. 
But, since a UMP test does not always exist, new questions now arise. Does the 
existence of a UMP test imply the existence of a corresponding sufficient statistic ? 
And, conversely, does the existence of a sufficient statistic guarantee the existence of a 
corresponding UMP test? 


22.21 ‘The first of these questions may be affirmatively answered if an additional 
condition is imposed. In fact, as Neyman and Pearson (1936a) showed, if (1) there is a 
common BCR for, and therefore a UMP test of, H, against H, for every size « in an 
interval 0 < « < a» (where a, is not necessarily equal to 1); and (2) if every point 
in the sample space W (save possibly a set of measure zero) forms part of the boundary 
of the BCR for at least one value of «, and then corresponds to a value of L(x| Hy) > 0; 
then a single sufficient statistic exists for the parameter(s) whose variation provides the 
class of admissible alternatives H,. 

To establish this result, we first note that, if a common BCR exists for H, against H, 
for two test sizes a, and «, < «,, a common BCR of size «, can always be formed as 
a sub-region of that of size «,. This follows from the fact that any common BCR 
satisfies (22.6). We may therefore, without loss of generality, take it that as « decreases, 
the BCR is adjusted simply by exclusion of some of its points.) 

Now, suppose that conditions (1) and (2) are satisfied. If a point (say, x) of w 
forms part of the boundary of the BCR for only one value of «, we define the statistic 


tix) = &. (22.43) 
If a point x forms part of the BCR boundary for more than one value of «, we define 
t(x) = $(a,+4,), (22.44) 


where «, and «, are the smallest and largest values of « for which it does so: it follows 
from the remark of the last paragraph that x will also be part of the BCR boundary 
for all « in the interval («,, «,). The statistic ¢ is thus defined by (22.43) and (22.44) 
for all points in W (except possibly a zero-measure set). Further, if ¢ has the same 
value at two points, they must lie on the same boundary. ‘Thus, from (22.6), we have 
L(x |6o) _ 
E(e|6,) 


() This is not true of critical regions in general—see, e.g., Chernoff (1951). 
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where k does not contain the observations except in the statistic ¢. ‘Thus we must 
have 


L(x|6) = g(t|0)h(x) (22.45) 


so that the single statistic ¢ is sufficient for 6, the set of parameters concerned. 


22.22 We have already considered in Example 22.2 a situation where single suff- 
ciency and a UMP test exist together. Exercises 22.1 to 22.3 give further instances. 
But condition (2) of 22.21 is not always fulfilled, and then the existence of a single 
sufficient statistic may not follow from that of a UMP test. The following example 
illustrates the point. 


Example 22.10 


In Example 22.9, we showed that the distribution (22.41) admits a UMP test of 
the H, against the H, there described. ‘The UMP test is based on the BCR (22.42), 
depending on x, and &. 

We have already seen (cf. 17.36, Example 17.19 and Exercise 17.9) that the smallest 
observation x,,) is sufficient for 6 if o is known, and that # is sufficient for o if 0 is known. 
The pair of statistics x, and * are jointly sufficient, but there is no single sufficient 
statistic for 0 and o. 


22.23 On the other hand, the possibility that a single sufficient statistic exists 
without a one-sided UMP test, even where only a single parameter is involved, is made 
clear by Example 22.11. 


Example 22.11 
Consider the multinormal distribution of m variates x,,...,%,, with 
E(x,) = nd, 6 > 0, 
E (x,) = 0, Sst: 


and dispersion matrix 


n—1+6?, —1,.:..-—1 
— 1 
v= qc 30 (22.46) 
= se 
The determinant of this matrix is easily seen to be 
LV] = 6 
and its inverse matrix is 
Ss See 1 
1 1, 1462, 
Viz = ~~ 1 (22.47) 


1 146 
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Thus, from 15.3, the joint distribution is 
ji n2 = n 
ai = 0 (2a) exp {-3 EB (* —0)?+ Ex]}as, ae (22.48) 


Consider now the testing of the hypothesis H):0 = 0) > O against H,:0 = 0, > 0 
on the basis of a single observation. From (22.6), the BCR is given by 


L(x|0o) _ (81\ . 8 (¥—O5)? (*—8,)? < 
L(x|6;) @ = |‘ F F a = 
which reduces to 


#—0,)? (#-O)? _ 2 
( 1) _(%— 80)" < 72 08 (Ra A0/9s) 


0; % 
or 
52 (02 — 62) —2 0,0, (09-0) < 729 log (k,0/0 
&? (65 — 07) — 2% 09, (89-81) < ne og (A, 9/8). 
If 6, > 6,, this is of form 
K? (09+6;)—2% 0,0, < a,, (22.49) 
which implies 
' <-4:= <.. (22.50) 
If 6, < 6,, the BCR is of form 
K? (09+0,)-—2% 0,0, = dis (22.51) 
implying 
= OF = f,. (22.52) 


In both (22.50) and (22.52), the limits between which (or outside which) # has to lie 
are functions of the exact value of 0,. This difficulty, which arises from the fact that 
6, appears in the coefficient of *? in the quadratics (22.49) and (22.51), means that 
there is no BCR even for a one-sided set of alternatives, and therefore no UMP test. 

It is easily verified that # is a single sufficient statistic for 0, and this completes 
the demonstration that single sufficiency does not imply the existence of a UMP test. 


The power function 

22.24 Now that we are considering the testing of a simple H, against a composite 
H,, we generalize the idea of the power of a test defined at (22.2). As we stated there, 
the power is an explicit function of H,. If, as is usual, H, is formed by the variation 
of a set of parameters 0, the power of a test of Hy: 0 = 0, against the simple alternative 
H,:0 = 0, > 6, will be a function of the value of 6,. For instance, we saw in 
Example 22.3 that the power of the sample mean & as a test of the hypothesis that the 
mean y of a normal population is jo, against the alternative value wy > {Mo, is given 
by (22.18), a monotone increasing function of u,. (22.18) is called the power function 
of # as a test of H, against the class of alternatives H,: > fo. We indicate the 
compositeness of H, by writing it thus, instead of the form used for a simple 
FI: = fy > Mo. 

The evaluation of a power function is rarely so easy as in Example 22.3, since even 
if the sampling distribution of the test statistic is known exactly for both Hy and the 
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class of alternatives H, (and more commonly only approximations are available, especi- 
ally for H,), there is still the problem of evaluating (22.2) for each value of 0 in H,, 
which usually is a matter of numerical mathematics: only rarely is the power function 
exactly obtainable from a tabulated integral, as at (22.18). Asymptotically, however, 
the Central Limit theorem comes to our aid: the distributions of many test statistics 
tend to normality, given either H, or H,, as sample size increases, and then the 
asymptotic power function will be of the form (22.18), as we shall see later. 


Example 22.12 
The general shape of the power function (22.18) in Example 22.3 is simply that of 
the normal distribution function. It increases from the value 
G {-d,} =« 
at {4 = [My (in accordance with the size requirement) to the value 
Gi} = 05 
at & = fot = the first derivative G’ increasing up to this point ; as ~ increases beyond 


it, G’ declines to its limiting value of zero as G increases to its asymptote 1. 


22.25 Once the power function of a test has been determined, it is of obvious 
value in determining how large the sample should be in order to test Hy with given 
size and power. ‘The procedure is illustrated in the next example. 


Example 22.13 
How many observations should be taken in Example 22.3 so that we may test 
H,:u = 3 with « = 0-05 (i.e. d, = 1:6449) and power of at least 0-75 against the 
alternatives that uw > 3:5? Put otherwise, how large should m be to ensure that the 
probability of a Type I error is 0-05, and that of a Type II error at most 0-25 for 
west 
From (22.18), we require m large enough to make 
G {n* (3-5 —3)—1-6449 } = 0-75, (22.53) 
it being obvious that the power will be greater than this for ~ > 3-5. Now, from a 
table of the normal distribution 
G {0-6745 } = 0-75, (22.54) 
and hence, from (22.53) and (22.54), 
0-5n* — 1-6449 = 0-6745, 
whence 
n = (4-6368)* =. 21-5 .approx., 
so that = 22 will suffice to give the test the required power property. 


One- and two-sided tests 
22.26 We have seen in 22.18 that in general no UMP test exists when we test a 
parametric hypothesis H,: 6 = 0, against a two-sided alternative hypothesis, i.e. one 
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in which 6—6, changes sign. Nevertheless, situations often occur in which such an 
alternative hypothesis is appropriate; in particular, when we have no prior knowledge 
about the values of 6 likely to occur. In such circumstances, it is tempting to continue 
to use as our test statistic one which is known to give a UMP test against one-sided 
alternatives (0 > 0, or 6 < 0,) but to modify the critical region in the distribution of 
the statistic by compromising between the BCR for 0 > 6, and the BCR for 6 < 6,. 


22.27 For instance, in Example 22.2 and in 22.17 we saw that the mean «, used to 
test Hy: = My for the mean yu of a normal population, gives a UMP test against 
{ty < fo With common BCR # < a,, and a UMP test for uw, > ») with common BCR 
x > b,. Suppose, then, that for the alternative H,:u 4 so, which is two-sided, we 
construct a compromise critical region defined by 

KS Qy/2) 

> — (22.55) 
in other words, combining the one-sided critical regions and making each of them of 
size 4a, so that the critical region as a whole remains one of size «. 

We know that the critical region defined by (22.55) will always be less powerful 
than one or other of the one-sided BCR, but we also know that it will always be more 
powerful than the other. For its power will be, exactly as in Example 22.3, 

G{n¥ (u— 119) — diya} + G {a (Wo— 1) — daa}. (22.56) 
(22.56) is an even function of (u—jo), with a minimum at uw = My. Hence it is always 
intermediate in value between G{n?(uw—jy)—d,} and G {n?(u »—)—d,}, which are 
the power functions of the one-sided BCR, except when uw = fy, when all three ex- 
pressions are equal. ‘The comparison is worth making diagrammatically, in Fig. 22.3, 
where a single fixed value of m and of « is illustrated. 


22.28 We shall see later that other, less intuitive, justifications can be given for 
splitting the critical region in this way between the tails of the distribution of the test 
statistic. For the moment, the procedure is to be regarded as simply a common-sense 
way of insuring against the risk of extreme loss of power which, as Fig. 22.3 makes 


] — ee 


Fig. 22.3—Power functions of three tests based on x 


Critical region in both tails equally. 
—--— Critical region in upper tail. 
—e ee Critical region in lower tail. 
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clear, would be the result if we located the critical region in the tail of the statistic’s 
distribution which turned out to be the wrong one for the true value of 0. 


Choice of test size 

22.29 Throughout our exposition so far we have assumed that the test size « has 
been fixed in some way, and all our results are valid however that choice was made. 
We now turn to the question of how « is to be determined. | 

_In the first place, it is natural to suggest that « should be made “ small ”’ according 

to some acceptable criterion, and indeed it is customary to use certain conventional 
values of «, such as 0-05, 0-01 or 0-001. But we must take care not to go too far in 
this direction. We can only fix two of the quantities 1, « and f, even in testing a 
simple H, against a simple H,. If 1 is fixed, we can only in general decrease the value 
of a, the probability of Type I error, by increasing the value of £, the probability of 
Type II error. In other words, reduction in the size of a test decreases its power. 

This point is well illustrated in Example 22.3 by the expression (22.18) for the 
power of the BCR in a one-sided test for a normal population mean. We see there 
that as «—> 0, by (22.16) d,—> oo, and consequently the power (22.18) — 0. 

Thus, for fixed sample size, we have essentially to reconcile the size and power of 
the test. If the practical risks attaching to a Type I error are great, while those attach- 
ing to a Type II error are small, there is a case for reducing «, at the expense of in- 
creasing f, if m is fixed. If, however, sample size is at our disposal, we may, as in 
Example 22.13, ensure that 7 is large enough to reduce both « and f to any pre-assigned 
levels. These levels have still to be fixed, but unless we have supplementary informa- 
tion in the form of the costs (in money or other common terms) of the two types of 
error, and the costs of making observations, we cannot obtain an “ optimum ” com- 
bination of «, 8 and n for any given problem. It is sufficient for us to note that, how- 
ever « is determined, we shall obtain a valid test. 


22.30 The point discussed in 22.29 is reflected in another, which has sometimes 
been made the basis of criticism of the theory of testing hypotheses. 

Suppose that we carry out a test with « fixed, no matter how, and m extremely large. 
The power of a reasonable test will be very near 1, in detecting departure of any sort 
from the hypothesis tested. Now, the argument (formulated by Berkson (1938) ) 
runs: ‘“ Nobody really supposes that any hypothesis holds precisely : we are simply 
setting up an abstract model of real events which is bound to be some way, if only a 
little, from the truth. Nevertheless, as we have seen, an enormous sample would 
almost certainly (i.e. with probability approaching 1 as m increases beyond any bound) 
reject the hypothesis tested at any pre-assigned size «. Why, then, do we bother to 
test the hypothesis at all with a smaller sample, whose verdict is less reliable than the 
larger one’s?” 

This paradox is really concerned with two points. In the first place, if is fixed, 
and we are not concerned with the exactness of the hypothesis tested, but only with 
its approximate validity, our alternative hypothesis would embody this fact by being 
sufficiently distant from the hypothesis tested to make the difference of practical interest. 
This in itself would tend to increase the power of the test. But if we had no wish to 
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reject the hypothesis tested on the evidence of small deviations from it, we should 
want the power of the test to be very low against these small deviations, and this would 
imply a small « and a correspondingly high f and low power. 

But the crux of the paradox is the argument from increasing sample size. ‘The 
hypothesis tested will only be rejected with probability near 1 if we keep « fixed as n 
increases. ‘There is no reason why we should do this: we can determine « in any 
way we please, and it is rational, in the light of the discussion of 22.29, to apply the 
gain in sensitivity arising from increased sample size to the reduction of « as well as 
of 6. It is only the habit of fixing « at certain conventional levels which leads to the 
paradox. If we allow « to decline as m increases, it is no longer certain that a very 
small departure from H, will cause Hy to be rejected: this now depends on the rate 
at which « declines. 


22.31 ‘There is a converse to the paradox discussed in 22.30. Just as, for large n, 
inflexible use of conventional values of « will lead to very high power, which may 
possibly be too high for the problem in hand, so for very small fixed u their use will 
lead to very low power, perhaps too low. Again, the situation can be remedied by 
allowing « to rise and consequently reducing f. It is always incumbent upon the 
statistician to satisfy himself that, for the conditions of his problem, he is not sacrificing 
sensitivity in one direction to sensitivity in another. 


Example 22.14 


E. S. Pearson (discussion on Lindley (1953a) ) has calculated a few values of the 
power function (22.56) of the two-sided test for a normal mean, which we reproduce 
to illustrate our discussion. 


Table 22.1—Power function calculated from (22.56) 


The entries in the first row of the table give the sizes of the tests. 


Sample size (n) 


Value of 
L— bo 
20 


It will be seen from the table that when sample size is increased from 20 to 100, 
the reductions of « from 0-050 to 0-019 and 0-0056 successively reduce the power of the 
test for each value of |w—,)|. In fact, for « = 0-0056 and | w—y,| = 0-1, the 
power actually falls below the value attained at nm = 20 with « = 0-05. Conversely, 

N 
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on reduction of sample size from 20 to 10, the increase in « to 0-072 and 0-111 increases 
the power correspondingly, though only in the case « = 0-111, | w—yy| = 0-2, does 
it exceed the power at nm = 20, a = 0-05. 


EXERCISES 


22.1 Show directly by use of (22.6) that the BCR for testing a simple hypothesis 
Hy: = My concerning the mean vu of a Poisson distribution against a simple alternative 
H,: 4 = fs, is of the form 
% <a, if Ho > 1; 

>b, if bo < Hy 
where * is the sample mean and a,, b, are constants. 


22.2 Show similarly that for the parameter 2 of a binomial distribution, the BCR 

is of the form 
<a, £ 8 > ay, 
eee be SS <Se, 


’ 


where x is the observed number of “‘ successes’ in the sample. 


22.3. Show that for the normal distribution with zero mean and variance o?, the 
BCR for H,:o = oy against the alternative H,:o0 = o, is of form 


Show that the power of the BCR when 0 > o, is Fe : %- a} where a is the 


lower 100« per cent point and F is the d.f. of the x? distribution with m degrees of freedom. 


22.4 In Example 22.4, show that the power of the test is 0-648 when A, = 1 and 
0-352 when k, = 0:5. Draw a diagram of the two Cauchy distributions to illustrate 
the power and size of each test. 


22.5 In Exercise 22.3, verify that the power is a monotone increasing function of 
o2/o?, and also verify numerically from a table of the x? distribution that the power is a 
monotone increasing function of 1. 


22.6 Confirm that (22.21) holds for the sufficient statistics on which the BCR of 
Example 22.2, and Exercises 22.1—22.3 are based. 


22.7 In 22.15 show that the more efficient estimator always gives the more powerful 


test if its test power exceeds 0°5. 
(Sundrum, 1954) 


22.8 Show that for testing Hy: = >) in samples from the distribution 
adi = dx, wex< ptt, 
there is a pair of UMP one-sided tests, and hence no UMP test for all alternatives. 
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22.9 In Example 22.11, show that ¥ is normally distributed with mean @ and variance 
6?/n", and that it is a sufficient statistic for 0. 


22.10 Verify that the distribution of Example 22.10 does not satisfy condition (2) 
of 22.21. 


22.11 In Example 22.9, let o be any positive increasing function of 6. Show that 
to test H,: 6 = 0) against H,:0 = 0, < 0, there is still a common BCR. 
(Neyman & Pearson, 1936a) 


22.12 Generalizing the discussion of 22.27, write down the power function of any 
test based on the distribution of X with its critical region of form 
x S ax,) 
x z Da» 
where «,+0%, = «(«, and «, not necessarily being equal). Show that the power function 
of any such test lies completely between those for the cases x, = 0, «, = O illustrated 


in Fig. 22.3. 


22.13 Referring to the discussion of 22.14, show that the likelihood ratio (for testing 
a simple H,:0 = 0, against a simple H,:0 = 0,) is a sufficient statistic for 0 on either 
hypothesis by writing the Likelihood Function as 


L (x6) = L(x |) E (x | = (0-0,)/(04—0) 


L (x | 63) 
(Pitman, 1957) 


CHAPTER 23 
TESTS OF HYPOTHESES: COMPOSITE HYPOTHESES 


23.1 We have seen in Chapter 22 that, when the hypothesis tested is simple (speci- 
fying the distribution completely), there is always a BCR, providing a most powerful 
test, against a simple alternative hypothesis ; that there may be a UMP test against a 
class of simple hypotheses constituting a composite parametric alternative hypothesis ; 
and that there will not, in general, be a UMP test if the parameter whose variation 
generates the alternative hypothesis is free to vary in both directions from the value 
tested. 

If the hypothesis tested is composite, leaving at least one parameter value unspecified, 
it is to be expected that UMP tests will be even rarer than for simple hypotheses, but 
we shall find that progress can be made if we are prepared to restrict the class of tests 
considered in certain ways. | 


Composite hypotheses 
23.2 First, we formally define the problem. We suppose that the m observations 
have a distribution dependent upon the values of /(< ) parameters which we shall 


write 
LEG n. «<5 8p 


as before. 'The hypothesis to be tested is 
70, =f: 2 6, = Oia, (23.1) 


where k < 1, and the second suffix 0 denotes the value specified by the hypothesis. 
We lose no generality by thus labelling the k parameters whose values are specified by 
H, as the first k of the set of / parameters. H, as defined at (23.1) is said to impose k 
constraints, or alternatively to have 1—k degrees of freedom, though this latter (older) 
usage requires some care since we already use the term “ degrees of freedom ”’ in 
another sense. 
Hypotheses of the form 
30, = Ot Og ee Oy ess 


which do not specify the values of parameters whose equality we are testing, may be 
transformable into the form (23.1) by reparametrizing the problem in terms of 0,—96,, 
6,—0,, etc., and testing the hypothesis that these new parameters have zero values. 
Thus (23.1) is a more general composite hypothesis than at first appears. 

To keep our notation simple, we shall write L(x|6,, 6;) and 


Bt. = tee (23.2) 
where it is to be understood that 0,, 0, may each consist of more than one parameter, 


the “ nuisance parameter” 0, being left unspecified by the hypothesis tested. 
186 
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An optimum property of sufficient statistics 


23.3 This is a convenient place to prove an optimum test property of sufficient 
statistics analogous to the result proved in 17.35. ‘There we saw that if f, is an un- 
biassed estimator of 0 and f¢ is a sufficient statistic for 0, then the statistic E(t, | t) is 
unbiassed for 6 with variance no greater than that of f,. We now prove a result due to 
Lehmann (1950): if w is a critical region for testing Hy, a hypothesis concerning 0 
in L(x|6), against some alternative H,, and ¢ is a sufficient statistic, both on Hy and 
on H,, for 6, then there is a test of the same size, based on a function of t, which has 
the same power as w. 

We first define a function (*) 


= 1 if the sample point is in w, 
coo) 0 otherwise. iAt#) 
Then the integral | c(w) L(x|0)dx = E {e(w) } (23.4) 


gives the probability that the sample point falls into w, and is therefore equal to the 
size («) of the test when H, is true and to the power of the test when H, is true. Using 
the factorization property (17.68) of the Likelihood Function in the presence of a sufficient 
statistic, (23.4) becomes 


E {olw) } = | e(w)h(w| dg (|) ade 
= E {E(w)|2)}, (23.5) 


the expectation operation outside the braces being with respect to the distribution of ¢. 
Thus the particular function of ¢, E(c(w) | t), not dependent upon @ since ¢ is sufficient, 
has the same expectation as c(w). There is therefore a test based on the sufficient 
statistic t which has the same size and power as the original region w. We may there- 
fore without loss of power confine the discussion of any test problem to functions of 
a sufficient statistic. 

_ This result is quite general, and therefore also covers the case of a simple H, dis- 
cussed in Chapter 22. 


Test size for composite hypotheses : similar regions 

23.4 Since a composite hypothesis leaves some parameter values unspecified, a 
new problem immediately arises, for the size of the test of Hy will obviously be a func- 
tion, in general, of these unspecified parameter values, 0,. 

If we wish to keep Type I errors down to some preassigned level, we must seek 
critical regions whose size can be kept down to that level for all possible values of 6,. 
Thus we require 

a(O,) < a. (23.6) 

If a critical region has a(0,) = « (23.7) 


(*) ¢(w) is known in measure theory as the characteristic function of the set of points w. We 
shall avoid this terminology, since there is some possibility of confusion with the use of ‘‘ char- 
acteristic function ’’ for the Fourier transform of a distribution function, with which we have 
been familiar since Chapter 4. 
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as a strict equality for all 0,, it is called a (critical) region similar to the sample space \*) 
with respect to 0,, or, more briefly, a similar (critical) region. The test based on a 
similar critical region is called a similar size-« test. 


23.5 It is not obvious that similar regions exist at all generally, but, in one sense, 
as Feller (1938) pointed out, they exist whenever we are dealing with a set of m inde- 
pendent identically distributed observations on a continuous variate x. For no matter 
what the distribution or its parameters, we have 

: Pit, = < We = 5 = Hee (23.8) 
(cf. 11.4), since any of the 2! permutations of the x, is equally likely. Thus, for « 
an integral multiple of 1/z!, there are similar regions based on the m! hypersimplices 
in the sample space obtained by permuting the m suffixes in (23.8). 


23.6 If we confine ourselves to regions defined by symmetric functions of the 
observations (so that similar regions based on (23.8) are excluded) it is easy to see that, 
where similar regions do exist, they need not exist for all sample sizes. For example, 
for a sample of nm observations from the normal distribution 

dF (x) = (2x) exp {—4(x—6)?} dx, 
there is no similar region with respect to 0 for n= 1, but for n>2 the fact that 
ns = > (x;—%)? has a chi-squared distribution with (n—1) degrees of freedom, what- 
i=1 


ever the value of 0, ensures that similar regions of any size can be found from the 
distribution of s2._ This is because * is a single sufficient statistic for 0, and to find 
a similar region we must, by Exercise 23.3, find a statistic uncorrelated with 
= = n(x—6). 

This is impossible when n = 1, since # = x is then the whole sample, but for n 2 2, 
Xx (x—*)? is distributed independently of * and thus gives similar regions. ‘The 
same argument holds in Exercise 23.1, where there is a pair of sufficient statistics for 
two parameters, and at least three observations are required so that we may have a 
statistic independent of both sufficient statistics. 


23.7 Even if n is large, symmetric similar regions will not exist if each observation 
bring a new parameter with it, as in the following example, due to Feller (1938). 


Example 23.1 
Consider a sample of m observations, where the 7th observation has distribution 
dF (x;) = (20) -# exp {—4(x,—6,)?} dx, 
so that L(x|6) = (20) exp {—3 4 (x;—9;)*}. 
For a similar region w of size «, we require, identically in 6, 


| L@|6)de = a. 


(*) The term arose because, trivially, the entire sample space is a similar region with « = 1. 
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Using (23.3), we may re-write this size condition as 
| 2 =i, (23.9) 
4 a 
where W is the whole sample space. Differentiating (23.9) with respect to 0;, we find 
{ L(x 02) (x, =0;)dx = 0: (23.10) 
A second differentiation with respect to 0; gives 
{212 f (x,—-0;)°—1}dx = 0. (23.11) 
Now from the definition of c(w), 
atu dis L(x|0) (23.12) 


is a (joint) frequency function. (23.10) and (23.11) express the facts that the marginal 
distribution of x, in g(x|6) has 

E(x;) = 0; varx,; = 1, 
just as it has in the initial distribution of ¥;. 

If we examine the form of g(x|6), we see that if we were to proceed with further 
differentiations, we should find that al/ the moments and product-moments of g(x | @) 
are identical with those of L(x«|6), which is uniquely determined by its moments. 
Thus, from (23.10), c(w)/a = 1 identically. But since c(w) is either 0 or 1, we see 
finally that the trivial values « = 0 or 1 are the only values for which similar regions 
can exist. The difficulty here is that all m observations are required to form a sufficient 
set for the m parameters, and we can find no statistic independent of them all. 


23.8 It is nevertheless true that for many problems of testing composite hypotheses, 
similar regions exist for any size « and any sample size n. We now have to consider 
how they are to be found. 

Let t be a sufficient statistic for the parameter 0, unspecified by the hypothesis 
H,, and suppose that we have a critical region w such that for all values of ¢, when 
i, 26; true, 


E {c(w)|t} = a. (23.13) 
Then, on taking expectations with respect to t, we have, as at (23.5), 
E {c(w) }} = E {E(c(w) |t) } = « (23.14) 


so that the original critical region w is similar of size x, as Neyman (1937b) and Bartlett 
(1937) pointed out. ‘Thus w is composed of a fraction « of the probability content 
of each contour of constant 7. 

It should be noticed that here ¢ need be sufficient only for the unspecified parameter 
0,, and only when Hy is true. This should be contrasted with the more demanding 
requirements of 23.3. 

Our argument has shown that (23.13) is a sufficient condition that w be similar. 
We shall show in 23.19 that it is necessary and sufficient, provided that a further con- 
dition is fulfilled, and in order to state that condition we must now introduce, following 
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Lehmann and Scheffé (1950), the concept of the completeness of a parametric family 
of distributions, a concept which also permits us to supplement the discussion of 
sufficient statistics in Chapter 17. 


Complete parametric families and complete statistics 

23.9 Consider a parametric family of (univariate or multivariate) distributions, 
f(x | 6), depending on the value of a vector of parameters 6. Let h(x) be any statistic, 
independent of 0. If for all 6 

E {h(x)} = | #(s)f (10) ae = 0 (23.15) 
implies that Aix} = 0 (23.16) 
identically (save possibly on a zero-measure set), then the family f(«| 0) is called com- 
plete. If (23.15) implies (23.16) only for all bounded h(x), f(«|@) is called boundedly 
complete. 

In the statistical applications of the concept of completeness, the family of distribu- 
tions we are interested in is often the sampling distribution of a (possibly vector-) 
statistic t, say g(¢|0). We then call ¢t a complete (or boundedly complete) statistic if, 
for all 6, E {h(t)} = 0 implies h(t) = 0 identically, for all functions (or bounded 
functions) A(t). In other words, we label the statistic ¢ with the completeness property 
of its distribution. 

An evident immediate consequence of the completeness of a statistic t is that only 
one function of that statistic can have a given expected value. ‘Thus if one function 
of ¢ is an unbiassed estimator of a certain function of 6, no other function of ¢ will be. 
Completeness confers a uniqueness property upon an estimator. 

J. K. Ghosh and Singh (1966) use a theorem by Wiener to show that if @ is a location 
parameter (i.e., f = f(x—9)), bounded completeness is equivalent to the c.f. f(t) being 


non-zero for all ¢. Thus, e.g., the Cauchy distribution of Example 17.7 is boundedly 
complete—the c.f. is given in Example 4.2, Vol. 1. 


The completeness of sufficient statistics 
23.10 ‘The special case of the exponential family (17.83) with A (6) = 0, B(x) = x 
has 


f(x|6) = exp {0x+ C(x)+D(6) }, — ©. 8 eo. (23.47) 
If, for all 6, [ AG) f(xe|0) de = 0, 
pe cepetnere { Th (xjexp {C() }]exp (Ox)dx = 0. (23.18) 


The integral in (23.18) is the two-sided Laplace transform (*) of the function in 


(*)'The two-sided Laplace transform of a function g(x) is defined by 


eo 

A(0) = | exp (0x) g (x) dx. 
— Oo 

The integral converges in a strip of the complex plane « < R(@) < f, where one or both of 

a, B may be infinite. (The strip may degenerate to a line.) Except possibly for a zero-measure 

set, there is a one-to-one correspondence between g(x) and (4(@). See, e.g., D. V. Widder (1941), 

The Laplace Transform, Princeton U.P., and compare also the Inversion Theorem for c.f’s. in 4.3. 
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square brackets in the integrand. By the uniqueness property of the transform, the 
only function having a transform of zero value is zero itself; i.e., 
h(x)exp {C(x) } = 0 
identically, whence 
h(x) = 0 
identically. Thus f(«|6) is complete. 
This result generalizes to the multi-parameter case, as shown by Lehmann and 
Scheffé (1955): the k-parameter, k-variate exponential family 


f(x|®) = exp {2,9 + c(w)+D(0)| (23.19) 


is a complete family. We have seen (Exercise 17.14) that the joint distribution of the 
set of k sufficient statistics for the k parameters of the general univariate exponential 
form (17.86) takes a form of which (23.19) is the special case, with A;(0) = 0; (We 
have replaced nD and OQ of the Exercise by D and exp(C) respectively.) By 23.3, we 
may confine ourselves, in testing hypotheses about the parent parameters, to the 
sufficient statistics. 


Example 23.2 
Consider the family of normal distributions 


Jix| 6,0) = (2x03) Fexp | —5-(—04)*}, = OER w; 0, > 0. 
2 


(a) If 0, is known (say = 1), the family is complete with respect to 6,, for we are 
then considering a special case of (23.17) with 
0 = 04, exp {C(x) } = (22)-texp(—4x") 
and 
D(6) = exp(—493). 
(b) If, on the other hand, 6, is known (say = 0), the family is not even boundedly 
complete with respect to 6,, for f(x|0, 6.) is an even function of x, so that any 
odd function h(x) will have zero expectation without being identically zero. 


23.11 In 23.10 we discussed the completeness of the characteristic form of the 
joint distribution of sufficient statistics in samples from a parent distribution with range 
independent of the parameters. Hogg and Craig (1956) have established the com- 
pleteness of the sufficient statistic for parent distributions whose range is a function of a 
single parameter 6 and which possess a single sufficient statistic for 6. We recall from 
17.40-1 that the parent must then be of form : 

(| 9) = g(x)/h (6) (23.20) 
and that 

(i) if a single terminal of f(x|6) is a function of 6 (which may be taken to be 6 


itself without loss of generality), the corresponding extreme order-statistic is 
sufficient ; 
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(ii) if both terminals are functions of 6, the upper terminal (5 (8) ) must be a mono- 
tone decreasing function of the lower terminal (6) for a single sufficient statistic 
to exist, and that statistic is then 


min {x(1)5 5-1 (X(ny) : (23.21) 
We consider the cases (i) and (ii) in turn. 


23.12 In case (i), take the upper terminal equal to 0, the lower equal to a constant 
a.- %m) is then sufficient for 6. Its distribution is, from (11.34) and (23.20), 


AG (Xi) = 2 {F (xn) }°*F (Xn) EX (ny 


n{ {* g(a)deh" #(%) 


= 04 __________ dy), 
{h (8) 5 8 
Now suppose that for a statistic u(x) we have 


6 
[1 (%o0) 4G (xn) = 0, 


or, substituting from (23.22), and dropping the factor in /(9), 


a < Xn) < 9. (23.22) 


f° wesiad{ Fe )deh” eC) dtm = 0. (23.23) 
If we differentiate (23.23) with —— to 6, we find 
(0) { Bia) ah 7(6) = 0, (23.24) 
and since the integral in braces an h(6), while g(0) 4 0 # h(6), (23.24) implies 
u(0) = 0 


for any 0. Hence the function u(a)) is identically zero, and the distribution of xq, 
given at (23.22), is complete. Exactly the same argument holds for the lower terminal 
and x). 


23.13 In case (ii), the distribution function of the sufficient statistic (23.21) is 
G(t) = P{xq), b-*(%m) < #3 
= P{xa) < t, mm < 6) $ 


“2 { { EG ae) (23.25) 


Differentiating (23.25) with respect to t, we obtain the frequency function of the 


sufficient statistic, 
4 


8) = aad], eeaeh Le (00}0'O-8O) 


6<t<_c(6). (23.26) 
If there is a statistic u(t) for which 


{ ote =o. (23.27) 
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we find, on differentiating (23.27) with respect to 6 and by following through the 
argument of 23.12, that u(?) = 0 for any 6, as before. ‘Thus u(t) = 0 identically 
and g(t) at (23.26) is complete. 


23.14 The following example is of a non-complete single sufficient statistic. 


Example 23.3 
Consider a sample of a single observation x from the rectangular distribution 
dF = dx, 0O<x< O+1. 
x is evidently a sufficient statistic. (There would be no single sufficient statistic 


for n > 2, since the condition (ii) of 23.11 is not satisfied.) 
Any bounded periodic function h(x) of period 1 which satisfies 


| h(x) dx arf 
0 
will give us 
6+1 6+1 1 
| h(x) dF = | h(x) dx = | h(x) dx = 0, 
6 6 0 


so that the distribution is not even boundedly complete, since /(x) is not identically 
zero. 


Minimal sufficiency 

23.15 We recall from 17.38 that, when we consider the problem of sufficient 
statistics in general (i.e. without restricting ourselves, as we did earlier in Chapter 17, 
to the case of a single sufficient statistic), we have to consider the choice between alter- 
native sets of sufficient statistics. In a sample of m observations we always have a 
set of m sufficient statistics (namely, the observations themselves) for the k (> 1) para- 
meters of the distribution we are sampling from. Sometimes, though not always, there 
will be a set of s(<n) statistics sufficient for the parameters. Often, s = k; e.g. all 
the cases of sufficiency discussed in Examples 17.15-16 have s = k = 1, while in 
Example 17.17 we have s = k = 2. By contrast, the following is an example in 
which s < k,(*) 


Example 23.4 
Consider again the problem of Example 22.11, with the alteration that 
E(x) = np 
instead of nf as previously. Exactly as before, we find for the joint distribution 


1 n2 = n 
dF = TOmReP | —} Fa(e—n)t+ Ei] baey = 


Here it is clear that the single statistic * is sufficient for the parameters yp, 0. 


(“) Fisher (e.g., 1956) called a sufficient set of statistics ‘‘ sufficient”? only if s = k and 
‘** exhaustive” if s > R. 
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23.16 We thus have to ask ourselves: what is the smallest number s of statistics 
which constitute a sufficient set in any problem? With this in mind, Lehmann and 
Scheffé (1950) define a vector of statistics as minimal sufficient if it is a function of all 
other vectors of statistics which are sufficient for the parameters of the distribution. 
The problems which now raise themselves are: how can we be sure, in any particular 
situation, that a sufficient vector is the minimal sufficient vector ? And can we find a 
construction which yields the minimal sufficient vector ? 

A partial answer to the first of these questions is supplied by the following result : 
if the vector t is a boundedly complete sufficient statistic for 8, and the vector u is a 
minimal sufficient statistic for @, then t is equivalent to u, i.e. they are identical, except 
possibly for a zero-measure set. | 

The proof is simple. Let w be a region in the sample space for which 


D = E(c(w)|t)—E(c(w)|u) # 0, (23.28) 


where the function c(w) is defined at (23.3). From (23.28), we find, on taking expecta- 
tions over the entire sample space, 

| E(D) = 0. (23.29) 
Now since u is minimal sufficient, it is a function of t, another sufficient statistic, by 
definition. Hence we may write (23.28) 

D= ht} 20. (23.30) 

Since D is a bounded function, (23.29) and (23.30) contradict the assumed bounded 
completeness of t, and thus there can be no region w for which (23.28) holds. Hence 
t and u are equivalent statistics, i.e. t is minimal sufficient. 

The converse does not hold: while bounded completeness implies minimal 
sufficiency, we can have minimal sufficiency without bounded completeness. An im- 
portant instance is discussed in Example 23.10 below. 

A consequence of the result of this section is that there cannot be more than one 
boundedly complete sufficient statistic for a parameter. Example 23.2(b) has x? 
minimal sufficient and complete, where x is sufficient and not complete. 

An alternative formulation of the problem of minimal sufficiency is given by 
Dynkin (1951). 


23.17 In view of the results of 23.10-13 concerning the completeness of sufficient 
statistics, a consequence of 23.16 is that all the examples of sufficient statistics we have 
discussed in earlier chapters are minimal sufficient, as one would expect on intuitive 
grounds. 


23.18 The result of section 23.16, though useful, is less direct than the follow- 
ing procedure for finding a minimal sufficient statistic, given by Lehmann and 
Scheffé (1950). 

We have seen in 22.14 and 22.20 that in testing a simple hypothesis, the ratio of 
likelihoods is a function of the sufficient (set of) statistic(s). _We may now, so to speak, 


(*) That this is for practical purposes equivalent to a sufficient statistic with minimum number 
of components is shown by Barankin and Katz (1959). See also Barankin (1960a, 1960b, 1961) 
and Fraser (1963). 
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put this result into reverse, and use it to find the minimal sufficient set. Writing 
L(x|0) for the LF as before, where x and 6 may be vectors, consider a particular set 
of values x, and select all those values of x within the permissible range for which 
L(x|0) is non-zero and 
L (x | 0) 
L (x9| 9) 
is independent of 6. Now any sufficient statistic ¢ (possibly a vector) will satisfy 


(17.68), whence 


= FA he) (23.31) 


L(x|6) _ g(¢|@) A(x) 

s ee 25.02 
L(x019) — g(tol9) A (%) — 
so that if t = t, (23.32) reduces to the form (23.31). Conversely, if (23.31) holds for 
all 6, this implies the constancy of the sufficient statistic ¢ at the value tf). ‘This may 
be used to identify sufficient statistics, and to select the minimal set, in the manner 
of the following examples. 


Example 23.5 

We saw in Example 17.17 that the set of statistics (%, s”) is jointly sufficient for the 
parameters (u, o?) of a normal distribution. For this distribution, L(x| 6) is non-zero 
for all o2 > 0, and the condition (23.31) is, on taking logarithms, that 


kes {(z 4-234) 2 un(s—%,)} (23.33) 


be independent of (, 02), i.e. that the term in braces be equal to zero. This will be 
so, for example, if every x; is equal to the corresponding «»,, confirming that the set of 
n observations is a jointly sufficient set, as we have remarked that they always are. 
It will also be so if the x; are any rearrangement (permutation) of the x9;: thus 
the set of order-statistics is sufficient, as it is again obvious that they always are. Finally, 

the condition in (23.33) will be satisfied if 
i= By Lx} = UX (23 34) 

a a 


and clearly, from inspection, nothing less than this will do. Thus the pair (%, x?) is 
minimal sufficient : equivalently, since ms? = Xx*—n4?, (X, s*) is minimal sufficient. 


Example 23.6 
As a contrast, consider the Cauchy distribution of Example 17.7. L(x| 6) is every- 
where non-zero and (23.31) requires that 


II {1 + (stoe— 9)" } / I (1+ (0, —0)?} (23.35) 


be independent of 6. As in the previous example, the set of order-statistics is sufficient, 
but nothing less will do here, for (23.35) is the ratio of two polynomials, each of degree 
2n in 6. If the ratio is to be independent of 0, each polynomial must have the same 
set of roots, possibly permuted. ‘Thus we are thrown back on the order-statistics as 
the minimal sufficient set. 
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Completeness and similar regions 

23.19 After our lengthy excursus on completeness, we return to the discussion 
of similar regions in 23.8. We may now show that if, given H), the sufficient statistic ¢ 
is boundedly complete, all size-« similar regions must satisfy (23,13). For any such 
region, (23.14) holds and may be re-written 


E {E(c(w)|t)—a} = 0. (23.36) 


The expression in braces in (23.36) is bounded. Thus if ¢ is boundedly complete, 
(23.36) implies that E'(c(w)|t)—« = 0 identically, i.e. that (23.13) holds. 

The converse result also holds: if all similar regions satisfy (23.13), then Lehmann 
and Scheffé (1950) proved that tf must be boundedly complete. ‘Thus the bounded 
completeness of a sufficient statistic is equivalent to the condition that all similar 
regions w satisfy (23.13). 


The choice of most powerful similar regions 


23.20 .The importance of the result of 23.19 is that it permits us to reduce the 
problem of finding most powerful similar regions for a composite hypothesis to the 
familiar problem of finding a BCR for a simple hypothesis. 

By 23.19, the bounded completeness of the statistic t, sufficient for 0, on Ho, implies 
that all similar regions w satisfy (23.13), i.e. every similar region is composed of a 
fraction « of the probability content of each contour of constant t. We therefore may 
conduct our discussion with ¢ held constant. Constancy of the sufficient statistic, ¢, 
for 0, implies from (17.68) that the conditional distribution of the observations in the 
sample space will be independent of 0,. Thus the composite Hy with 6, unspecified 
is reduced to a simple H, with t held constant. If ¢ is also sufficient for 0, when H, - 
holds, the composite H, is also reduced to a simple H, with ¢ constant (and, inci- 
dentally, the power of any critical region with ¢ constant, as well as its size, will be 
independent of 0,). If, however, ¢ is not sufficient for 6, when H, holds, we consider 
H, as aclass of simple alternatives to the simple Ho, in just the manner of the previous 
chapter. 

Thus, by keeping ¢ constant, we reduce the problem to that of testing a simple 
H, concerning 0, against a simple H, (or a class of simple alternatives constituting H,). 
We use the methods of the last chapter, based on the Neyman-Pearson lemma (22.6), 
to seek a BCR (or common BCR) for Hy against H,. If there is such a BCR for each 
fixed value of t, it will evidently be an unconditional BCR, and gives the most powerful 
similar test of H, against H,. Just as previously, if this test remains most powerful 
against a class of alternative values of 60,, it is a UMP similar test. 


Example 23.7 
To test Hy: u = uy against H,:u = m, for the normal distribution 


ote gpe BE |: 18 
dF = pe { (=A) bar OLN BS: 


H, and H, are composite with one degree of freedom, o? being unspecified. 
From Examples 17.10 and 17.15, the statistic (calculated from a sample of n inde- 
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n 
pendent observations) u = & (x; — Mo)? is sufficient for o? when Hy, holds, but not 
i=1 


otherwise. From 23.10, u is a complete statistic. All similar regions for H, therefore 
consist of fractions « of each contour of constant u. 
Holding u fixed, we now test 


Hy: u =m, against H,: 4 = wy, 6 = 0, 
both hypotheses being simple. The BCR obtained from (22.6) is that for which 
L(x | Ho) < k ; 
L(x|Hy)  * 
This reduces, on simplification, to the condition 


% (U1— Ho) 2 C'(Mo, M1) o*, O15 Ryy u) (23.37) 
where C is a constant containing no function of x except u. Thus the BCR consists 
of large values of % if 4,—j4) > 0 and of small values of % if “4,—j9 < 0, and this is 
true whatever the values of o? and o?, and whatever the magnitude of |u“,—j)|. Thus 
we have a common BCR for the class of alternatives H,: u = m, for each one-sided 
situation “, > Mo and my < [o. 

We have been holding u fixed. Now 


w= E (eye)? = E(e— a+ m(@— 5) (23.38) 
Z Bea {1 +See |. (23.39) 


Since the BCR for fixed u consists of extreme values of #, (23.38) implies that the 
BCR consists of small values of =(x—)?, which by (23.39) implies large values of 


P _ n(%—Mo)” 

i ee (23.40) 
t? as defined by (23.40) is the square of the “ Student’s ” ¢ statistic whose distribution 
was derived in Example 11.8. By Exercise 23.7, ¢, which is distributed free of 0°, is 
consequently distributed independently of the complete sufficient statistic, u, for o”. 
Remembering the necessary sign of #, we have finally that the unconditional UMP 
similar test of H, against H, is to reject the largest or smallest 100« per cent of the 
distribution of ¢ according to whether uw, > Mo OF My < Lo. 

As we have seen, the distribution of ¢ does not depend on o?. The power of the 
UMP similar test, however, does depend on o?, for u is not sufficient for o? when Hy 
does not hold. Since every similar region for H, consists of fractions « of each con- 
tour of constant uw, and the distribution on any such contour is a function of o? when 
H, holds, there can be no similar region for Hy with power independent of o?, a result 


first established by Dantzig (1940). 
Example 23.8 


For two normal distributions with means 4, “ + 6 and common variance o?, to test 
H,:9 = 0, (= 0, without loss of generality) 
against Bet =4, 
on the basis of independent samples of size ,, m, with means *, %2. 
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Write 


n= N1+MNo, 


NX = NX, +NogXo, (23.41) 
2 ™ 
f= GG 2 (x;;-x)? = LL xi — nx. 
04 Jond 


The hypotheses are composite with two degrees of freedom. When H, holds, but 
not otherwise, the pair of statistics (%, s®) is sufficient for the unspecified parameters 
(u, o?), and it follows from 23.10 that (%, s?) is complete. Thus all similar regions 
for H, satisfy (23.13), and we hold (%, s*) fixed, and test the simple 


i, : 6 = 0 

against 7,10 = 35 8 = ey o = Se 

Our original H, consists of the class of H, for all 4, 04. 

The BCR obtained from (22.6) reduces, on simplification, to 

| %291 < Ba 

where g, is a constant function of all the parameters, and of « and s®, but not otherwise 
of the observations. For fixed #, s?, the BCR is therefore characterized by extreme 
values of *, of opposite sign to 6,, and this is true whatever the values of the other 
parameters. (23.41) then implies that for each fixed (%, s*), the BCR will consist of 
“#,) 


52 


large values of (Fs , and hence of the equivalent monotone increasing function 


(*,—*,)? eae n 
= (x,—#,)?+2U(x,—#,)? (n—2) nn, oe) 
(23.42) is the definition of the usual ‘“‘ Student’s ”’ z? statistic for this problem, which 
we have encountered as an interval estimation problem in 21.12. By Exercise 23.7, 
t?, which is distributed free of « and o?, is distributed independently of the complete 
sufficient statistic (%, s?) for (u, o*). ‘Thus, unconditionally, the UMP similar test of 
H, against H, is given by rejecting the 100« per cent largest or smallest values in the 
distribution of t, according to whether 6, (or, more generally, 0,—6 )) is positive or 
negative. 

Here, as in the previous example, the power of the BCR depends on (, o?), since 
(%, s?) is not sufficient when H, does not hold. 


Example 23.9 
To test the composite H):o = oy against H,:o = o, for the distribution 


av = exp{—(*—*) Lav/a, Genes et eee 
We have seen (Example 17.19) that x,), the smallest of a sample of m independent 
observations, is sufficient for the unspecified parameter 0, whether H, or H, holds. 
By 23.12 it is also complete. ‘Thus all similar regions consist of fractions « of each 
contour of constant x). 
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The comprehensive sufficiency of x1) renders both Hy,and H, simple when xq) is 
fixed. The BCR obtained from (22.6) consists of points satisfying 


- ee: 
p> we es < a? 
i= G = ; . 
where g, is a constant, a function of 09, 0;. For each fixed xq), we therefore have the 
BCR defined by 


2X; < &; ig SO; (23.43) 
2. Dy iL Ds >, 

The statistic in (23.43), Xx,, is not distributed independently of xq). To put (23.43) 

in a form of more practical value, we observe that the statistic 


n 
z= & (x—%X)) 
i=1 


is distributed independently of x). (This is a consequence of the completeness and 
sufficiency of «,;,—see Exercise 23.7 below.) ‘Thus if we rewrite (23.43) for fixed x, as 


at, - 0; = Uy 
ge 5S =< (23.44) 
where ¢c, = 4,—NXq), d, = b,—nx), we have on the left of (23.44) a statistic which 
for every fixed x) determines the BCR by its extreme values and whose distribution 
does not depend on xj). Thus (23.44) gives an unconditional BCR for each of the 
one-sided situations o, < 09, 0, > 0», and we have the usual pair of UMP tests. 

Note that in this example, the comprehensive sufficiency of xq) makes the power 
of the UMP tests independent of 0 (which is only a location parameter). 


23.21 Examples 23.7 and 23.8 afford a sophisticated justification for two of the 
standard normal distribution test procedures for means. Exercises 23.13 and 23.14 
at the end of this chapter, by following through the same argument, similarly justify 
two other standard procedures for variances, arriving in each case at a pair of UMP 
similar one-sided tests. Unfortunately, not all the problems of normal test theory are 
so tractable: the thorniest of them, the problem of two means which we discussed at 
length in Chapter 21, does not yield to the present approach, as the next example 
shows. 


Example 23.10 


For two normal distributions with means and variances (6, o7), (0+, 03), to test 
H,: = 0 on the basis of independent samples of nm, and n, observations. 

Given H5, the sample means and variances (#,, 9, s?, 53) = t form a set of four 
jointly sufficient statistics for the three parameters 6, of, o3 left unspecified by Hy. 
They may be seen to be minimal sufficient by use of (23.31)—cf. Lehmann and Scheffé 
(1950). But t is not boundedly complete, since %,, #, are normally distributed inde- 
pendently of sj, s} and of each other, so that any bounded odd function of (#,—<,) 
alone will have zero expectation. We therefore cannot rely on (23.13) to find all similar 
regions, though regions satisfying (23.13) would certainly be similar, by 23.8. But it 

O 
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is easy to see, from the fact that the Likelihood Function contains the four components 
of t and no other functions of the observations, that any region consisting entirely of a 
fraction « of each surface of constant t will have the same probability content in the 
sample space whatever the value of u, and will therefore be an ineffective critical region 
with power exactly equal to its size. ‘This disconcerting aspect of a familiar and useful 
property of normal distributions was pointed out by Watson (1957a). 

No useful exact unrandomized similar regions exist for this problem—see Linnik 
(1964). If we are prepared to use asymptotically similar regions, we may use Welch’s 
method expounded in 21.25 as an interval estimation technique; similarly, if we are 
prepared to introduce an element of randomization, Scheffe’s method of 21.15-22 is 
available. ‘The relation between the terminology of confidence intervals and that 
of the theory of tests is discussed in 23.26 below. 


23.22 The discussion of 23.20 and Examples 23.8-10 make it clear that, if there 
is a complete sufficient statistic for the unspecified parameter, the problem of selecting 
a most powerful test for a composite hypothesis is considerably reduced if we restrict 
our choice to similar regions. But something may be lost by this—for specific alterna- 
tives there may be a non-similar test, satisfying (23.6), with power greater than the most 
powerful similar test. 

Lehmann and Stein (1948) considered this problem for the composite hypotheses 
considered in Example 23.7 and Exercise 23.13. In the former, where we are testing 
the mean of a normal distribution, they found that if « > 4 there is no non-similar 
test more powerful than ‘“ Student’s ” t, whatever the true values “,, o,, but that for 
a < 4 (as in practice it always is) there is a more powerful critical region, which is of 
form 


pe {#5 — Cu (Uy 01) J? < Ra (Uy 01). (23.45) 


Similarly, for the variance of a normal distribution (Exercise 23.13 below), they found 
that if o, > o) no more powerful non-similar test exists, but if o, < o the region 


3 (wi— my)? < hy (23.46) 


is more powerful than the best similar critical region. 

Thus if we restrict the alternative class H, sufficiently, we can sometimes improve 
the power of the test, while reducing the average value of the Type I error below the 
size «, by abandoning the requirement of similarity. In practice, this is not a very 
strong argument against using similar regions, precisely because we are not usually 
in a position to be very restrictive about the alternatives to a composite hypothesis. 


Bias in tests 


23.23 In the last chapter (22.26-8) we briefly discussed the problem of testing a 
simple H, against a two-sided class of alternatives, where no UMP test generally exists. 
We now return to this subject from another viewpoint, although the two-sided nature 
of the alternative hypothesis is not essential to our discussion, as we shall see. 
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Example 23.11 


Consider again the problem of Examples 22.2-3 and of 22.27, that of testing the 
mean yu of a normal population with known variance, taken as unity for convenience. 
Suppose that we restrict ourselves to tests based on the distribution of the sample mean 
*, as we may do by 23.3 since # is sufficient. Generalizing (22.55), consider the size-« 
region defined by 

oe a 2S. Oe, (23.47) 
where «,+a, = a, and «, is not now necessarily equal to «,. a and 6 are defined, as 
at (22.15), by 

a, = [kg — 4,,/n*, by, = Motd,/n', 


—dy 
and Gi-d) = | (2n)-# exp(—4y")dy = a. 
We take d, > 0 without loss of generality. 
Exactly as at (22.56), the power of the critical region (23.47) is seen to be 
P=G {n'A-d,, }+ G {-n'A-d, }, (23.48) 


where A = s41— No. 
We consider the power (23.48) as a function of A. Its first two derivatives are 


P= (2) [exp (4 (A—d,,)* }—exp (-E(WA+d,,)°}] (2349) 


and 
” nN 


= (27) [(4,,—m*A)exp {— 3 (nt A—d,,)? } 


+(ntA+d,)exp {—}(n'A+d,)?}]. (23.50) 
From (23.49), we can only have P’ = 0 if : 


A = (d,,—d,,)/(2n'*). (23.51) 
When (23.51) holds, we have from (23.50) 
tt iat 1 — ES 4 2 
= Qnyi (d,,+d,,)exp {—4(n?A+d,,)? }. (23.52) 


Since we have taken d, always positive, we therefore have P’’ > 0 at the stationary 
value, which is therefore a minimum. From (23.51), it occurs at A = 0 only when 
%, = Gg, the case discussed in 22.27. Otherwise, the unique minimum occurs at some 
value i, where 


Um > My if a > a, Um < fo if a < a. 


23.24 ‘The implication of Example 23.11 is that, except when «, = «, there exist 
values of w in the alternative class H, for which the probability of rejecting H, is actually 
smaller when H, is false than when it is true. (Note that if we were considering a 
one-sided class of alternatives (say, w1 > /49), the same situation would arise if we used 
the critical region located in the wrong tail of the distribution of # (say, # < a,).) It 
is clearly undesirable to use a test which is more likely to reject the hypothesis when 
it is true than when it is false. In fact, we can improve on such a test by using a table 
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of random numbers to reject the hypothesis with probability a—the power of this 
procedure will always be «. 

We may now generalize our discussion. If a size-« critical region w for Hy): 0=6 
against the simple H,: 6 = 0, is such that its power 

P{xew|0,} 24, (23.53) 
it is said to give an unbiassed(*) test of Hy against H,; in the contrary case, the region 
w, and the test it provides, are said to be biassed.(“) If H, is composite, and (23.53) 
holds for every member of H,, w is said to be an unbiassed critical region against A. 
It should be noted that unbiassedness does not require that the power function should 
actually have a regular minimum at 09, as we found to be the case in Example 23.11 
when «, = &, although this is often found to be so in practice. Fig. 22.3 on page 182 
illustrates the appearance of the power function for an unbiassed test (the full line) 
and two biassed tests. 

If no unbiassed test exists, there may be a “‘ locally unbiassed Type M ” test (Krishnan, 
1966) which has average power >«% in a neighbourhood of Hp. 

The criterion of unbiassedness for tests has such strong intuitive appeal that it is 
natural to restrict oneself to the class of unbiassed tests when investigating a problem, 
and to seek UMP unbiassed (UMPU) tests, which may exist even against two-sided 
alternative hypotheses, for which we cannot generally hope to find UMP tests without 
some restriction on the class of tests considered. Thus, in Example 23.11, the “ equal- 
tails ” test based on # is at once seen to be UMPU in the class of tests there considered. 
That it is actually UMPU among all tests of Hy will be seen in 23.33. 


Example 23.12 
We have left over to Exercise 23.13 the result that, for a normal distribution with 
mean yu and variance o?, the statistic z = y (x,—%)? gives a pair of one-sided UMP 
i=1 


similar tests of the hypothesis Hy: o? = 0, the BCR being 


oS en, a, > &, 2 = $b 2t G2] t> 
Now consider the two-sided alternative hypothesis 
Hy: eh 


By 22.18 there is no UMP test of H, against H,, but we are intuitively tempted to use 
the statistic z, splitting the critical region equally between its tails in the hope of 
achieving unbiassedness, as in Example 23.11. Thus we reject Ho if 

22a, OT 2 & bi, 
This critical region is certainly similar, for the distribution of z is not dependent on y, 
the nuisance parameter. Since z/o® has a chi-square distribution with (n—1) d.f., 


whether H, or H, holds, we have 


> eae Ss. Se 
Ary =—_ UG Xi—toe bi, = Os Xiao 


(*) This use of “ bias’ is unconnected with that of the theory of estimation, and is only 
prevented from being confusing by the fortunate fact that the context rarely makes confusion 
possible. 
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where 77 is the 100« per cent point of that chi-square distribution. When H, holds, 
it is /o} which has the erie and H, will then be rejected when 


a a oO. 
se ges ge 
2 “07h or ee 9 Ada 
o? OF O7 


The power of the test against any ree value o? is the sum of the probabilities 
of these two events. We thus require the probability that a chi-square variable will 
fall outside its 100(3«) per cent and 100(1—4«) per cent points each multiplied by a 
constant o$/oj. For each value of « and (n—1), the degrees of freedom, this probability 
can be calculated from a table of the distribution for each value of o3/o?. Fig. 23.1 
shows the power function resulting from such calculations by Neyman and Pearson 
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Fig. 23.1—Power function of a test for a normal distribution variance (see text) 


(1936b) for the case nm = 3, a = 0:02. The power is less than « in this case when 
0:5 < o%/o2 < 1, and the test is therefore biassed. 

We now enquire whether, by modifying the apportionment of the critical region 
between the tails of the distribution of z, we can remove the bias. Suppose that the 
critical region is 

She, OF BES, 
where a, + «, =a. As before, the power of the test is the probability that a chi- 
square variable with (7—1) degrees of freedom, say y,-1, falls outside the range of its 
100x, per cent and 100 (1—«,) per cent points, each multiplied by the constant 
0 = o3/oj. Writing F for the distribution function of y,_,, we have 


P = F(02,)+1—F (073-2). (23.54) 
Regarded as a function of 0, this is the power function. We now choose «, and «, so 


that this power function has a regular minimum at 6 = 1, where it equals the size of 
the test. Differentiating (23.54), we have 


P= ¥, Oe ti OTe) (23.55) 
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where f is the frequency function of y,-. If this is to be zero when 6 = 1, we require 
tart (Xa,) = Xi-a, F (xi-a,)- (23.56) 
Substituting for the frequency function 
f(y) « ey") dy, (23.57) 
we have finally from (23.56) the nee for unbiassedness 
{gab = exp OG} (23.58) 


Values of «, and a, satisfying (23.58) will give a test whose power function has zero 
derivative at the origin. ‘To investigate whether it is strictly unbiassed, we write 
(23.55), using (23. a and (23.58), as 


P’ = cO¥-9 y*-Sexp(—472) [exp (372,(1—0)}—exp{4zi-,(1-9)}], (23.59) 
where c is a positive constant. Since yj-., > 73, we have from (23.59) 


c= ee 2 ee = 
P2 =, G= 1, (23.60) 
Ss ee SS 


(23.60) shows that the test with «,,«, determined by (23.58) is unbiassed in the strict 
sense, for the power function is monotonic decreasing as 0 increases from 0 to 1 and 
monotonic increasing as 0 increases from 1 to ©. 

Tables of the values of y?, and y?_,, satisfying (23.58) are given by Ramachandran 
(1958) for « = 0-05 and n—1 = 2(1) 8 (2) 24, 30, 40 and 60; other tables are des- 
cribed in 20.21(3), where the terminology of confidence intervals is used—cf. 23.26 for 
the correspondence with tests. Table 23.1 compares some of Ramachandran’s values 
with the corresponding limits for the biassed “‘ equal-tails ’ test which we have con- 
sidered, obtained from the Biometrika Tables. 


Table 23.1—Limits outside which the chi-square variable =(x—%)?/o) must fall for 
H, : 6?=0% to be rejected («=90-05) 


ee teat Limnits cot limits ——— 
2 ( 0-08, 9-53) (0-05, 7-38) (0-03, 2-15) 
5 ( 0-99, 14-37) ( 0-83, 12-83) (0-16, 1:54) 
10 ( 3-52, 21-73) ( 3-25, 20-48) (0-27, 1-25) 
20 ( 9-96, 35-23) ( 9-59, 34-17) (0-37, 1-06) 
30 (17-21, 47-96) (16-79, 46-98) (0-42, 0:98) 
40 (24-86, 60-32) (24-43, 59-34) (0-43, 0-98) 
60 (40-93, 84-23) (40-48, 83-30) (0-45, 0:93) 


It will be seen that the differences in both limits are proportionately large for small 7, 
that the lower limit difference increases steadily with m, and the larger limit difference 
decreases steadily with n. At n—1 = 60, both differences are just over 1 per cent of 
the values of the limits. 

We defer the question whether the unbiassed test is UMPU to Example 23.14 
below. 
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Unbiassed tests and similar tests 7 
23.25 There is a close connection between unbiassedness and similarity, which 
often leads to the best unbiassed test emerging directly from an analysis of the similar 
regions for a problem. 
We consider a more general form of hypothesis than (23.2), namely 
A262 = Ua, (23.61) 
which is to be tested against 
Ht, > o. (23.62) 
If we can find a critical region w satisfying (23.6) for all 6, in Ho as well as for all values 
of the unspecified parameters 9,, 1.e. 


P (Hp, 6;) < «, (23.63) 
(where P is the power function whose value is the probability of rejecting Hy), the test 
based on w will be of size « as before. If it also unbiassed, we have from (23.53) 


P(H,,6,) > &. (23.64) 

Now if the power function P is a continuous function of 6,, (23.63) and (23.64) 
imply, in view of the form of Hj and Hy, 

P(0,5,0,)-=-4, (23.65) 
i.e. that w is a similar critical region for the ‘‘ boundary ”’ hypothesis 
Hy: 0, = D0. 

All ——- tests of H) are similar tests of Hy. If we confine our discussions to 
similar tests of Hy, using the methods we have PE eS and find a test with opti- 
mum properties—e.g., a UMP similar test—then provided that this test is unbiassed it 
will retain the optimum properties in the class of = tests of Hj—e.g. it will 
be a UMPU test. 

Exactly the same argument holds if Hj specifies that the parameter point 6, lies 
within a certain region R (which may consist of a number of subregions) in the para- 
meter space, and H, that the 0, lies in the remainder of that space : if the power function 
is continuous in 0,, then if a critical region w is unbiassed for testing H, it is a similar 
region for testing the hypothesis H, that 6, lies on the boundary of R. If w gives 
an unbiassed test of Hj, it will carry over into the class of unbiassed tests of Hj any 
optimum properties it may have as a similar test of Hy. ‘There will not always be 
a UMP similar test of H, if the alternatives are two-sided: a UMPU test may exist 
against such alternatives, but it must be found by other methods. 


Example 23.13 
We return to the hypothesis of Example 23.12. One-sided critical regions based 
on the statistic z > a,, 2 < b,, give UMP similar tests against one-sided alternatives. 
Each of them is easily seen to be unbiassed in testing one of 
H,i0' = a. Hi, : 0* > oF 
respectively against 
fie Ss oe < G.. 
Thus they are, by the argument of 23.25, UMPU tests for these one-sided situations. 
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For the two-sided alternative H,:o* #4 og, the unbiassed test based on (23.58) 
cannot be shown to be UMPU by this method, since we have not shown it to be UMP 
similar. 


Tests and confidence intervals 


23.26 The early work on unbiassed tests was largely carried out by Neyman and 
Pearson in the series of papers mentioned in 22.1, and by Neyman (1935, 1938b), 
Scheffé (1942a) and Lehmann (1947). Much of the detail of their work has now been 
superseded, as pioneering work usually is, but their terminology is still commonly used, 
and it is desirable to provide a “‘ dictionary ” connecting it with the present terminology, 
where it differs. We take the opportunity of translating the ideas of the theory of 
hypothesis-testing into those of the theory of confidence intervals, as promised in 
20.20. 

If a sample is observed, we may ask the question: for which values of @ does the 
sample point x form part of the acceptance region A complementary to the size-« 
critical region for a certain test on the parameter 6? If we aggregate these “ accept- 
able” values of 0, we obtain the level-(1 —«) confidence interval C for 6 corresponding 
to that test, for 9 is in C if and only if x is in A, i.e. with probability 1—a. We used 
this method of constructing confidence intervals in 20.3, and indeed throughout 
Chapter 20. ‘There is thus no need to derive optimum properties separately for tests 
and for intervals: there is a one-to-one correspondence between the problems. 


Property of test 
Property of corresponding 


Present Older terminology confidence interval 
terminology 


UMP ** Shortest ’’ (= most 
selective) 
Unbiassed Unbiassed 
UMPU Type A ) (simple Ho, one parameter) 
** locally ” Type B) (composite Ho) i Short ”’ unbiassed 
(i.e. near Ho) Type C\) (simple Hy, two or more parameters) 
UMPU Type A, (simple Ho, one parameter) ee re ee ee 


Type B,) (composite Ho) 
Unbiassed similar Bisimilar 


(*) Subject to regularity conditions. 


For example, in 20.31, we noticed that in setting confidence intervals for the mean pu 
of a normal distribution with unspecified variance, using “‘ Student’s ”’ ¢ distribution, the 
length of the interval was a random variable, being a multiple of the sample standard 
deviation. In Example 23.7, on the other hand, we remarked that the power of the 
similar test based on “ Student’s ” ¢ was a function of the unknown variance. Now 
the power of the test is the probability of rejecting the hypothesis when false, i.e. in 
confidence interval terms, is the probability of not covering another value of w than 
the true one, “4. If this probability is a function of the unknown variance, for all 
values of 4, we evidently cannot pre-assign the length of the interval as well as the con- 
fidence coefficient. Our earlier statement was a consequence of the later one. 
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UMPU tests for the exponential family 

23.27 We now give an account of some remarkably comprehensive results, due 
to Lehmann and Scheffé (1955), which establish the existence of, and give a construc- 
tion for, UMPU tests for a variety of parametric hypotheses in distributions belonging 
to the exponential family (17.86). We write the joint distribution of m independent 
observations from such a distribution as 


r+1 
f (x) = D(t)hA(x) exp} Xb; (") u; @), (23.66) 
j=l | 
where x is the column vector (x*,,...,%,) and t is a vector of (r+1) parameters 
(T1,.--+ 5Tr+ 1). In matrix notation, the exponent in (23.66) may be concisely written 


u’b, where u and b are column vectors. 
Suppose now that we are interested in the particular linear function of the para- 
meters 


r+1 
6 = & a;,b;(t), (23.67) 
j=1 | 
r+1 
where & a}, = 1. Write A for an orthogonal matrix (a,,) whose first column con- 
j=l 
tains the coefficients in (23.67), and transform to a new vector of (r+1) parameters 
(0, ~), where is the column vector (y,,..., y,), by the equation 


# = A'b. (23.68) 


The first row of (23.68) is (23.67). We now suppose that there is a column vector 
of statistics T = (s,f,,...,t,) defined by the relation 


T’ a) = n'bjire - (23.69) 


i.e. we suppose that the exponent in (23.66) may be expressed as 0s(x)+ & y;t;(x). 
j=1 


Using (23.68), (23.69) becomes 
6 0 
T’ = WA 23.70 
() = "4(y) — 


(23.70) is an identity in (0, p), so we have T’ = u’A or 
T = A’u. (23.71) 
Comparing (23.71) with (23.68), we see that each component of T is the same function 


of the u;(x) as the corresponding component of (0, ) is of the 5;(t). In particular, 
the first component is, from (23.67), 


ri 
$(a) =. 2.4), 4, (X) (23.72) 
j=l : 
while the ¢;(x), 7 = 1,2,...,7, are orthogonal to s(x). 
f+1 
Note that the orthogonality condition & aj, = 1 does not hamper us in testing 


q=1 
hypotheses about 6 defined by (23.67), since only a constant factor need be changed 
and the hypothesis adjusted accordingly. 
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23.28 If, therefore, we can reduce a hypothesis-testing problem (usually through 
its sufficient statistics) to the standard form of one concerning @ in 


F (19, p) = C0, W)h(x)exp {0s(x) +E yits(m)}. (23.73) 


by the device of the previous section, we can avail ourselves of the results summarized 
in 23.10: given a hypothesis value for 6, the r-component vector t = (t,,..., #,) will 
be a complete sufficient statistic for the r-component parameter tp = (71,..., Yr), and 
we-now consider the problem of using s and t to test various composite hypotheses 
concerning 0, W being an unspecified (‘‘ nuisance”’) parameter. Simple hypotheses 
are the special case when r = 0, with no nuisance parameter. 


23.29 For this purpose we shall need an extended form of the Neyman—Pearson 
lemma of 22.10. Let f(x|®) be a frequency function, and 0; a subset of admissible 
values of the vector of parameters 0, (i = 1,2,...,k). A specific element of 0; is 
written 6°. 6* is a particular value of @. The vector u,(x) is sufficient for 8 when 
6 is in 0; and its distribution is g;(u,|6,;). Since the Likelihood Function factorizes in 
the presence of sufficiency, the conditional value of f(x|®;), given u;, will be inde- 
pendent of 0;, and we write it f(x|u,). Finally, we define /;(x), m;(u,) to be non- 
negative functions, of the observations and of u; respectively. 

Now suppose we have a critical region w for which 


| {li(x) f (|) }dx = a (23.74) 


Since the product in braces in (23.74) is non-negative, it may be regarded as a frequency 
function, and we may say that the conditional size of w, given u,, is «; with respect 
to this distribution. We now write 


i= oe | m,(u,)g;(u, | 9) du, 
[ %¢a)me(u.){ f f(x] w.)g(w| 62) da} 
= { (x) m; (u,) f (x| 09) } dx. (23.75) 


The product in braces is again essentially a frequency function, say p(x|6?). ‘To test 
the simple hypothesis that p(x| 9?) holds against the simple alternative that f (x | *) 
holds, we use (22.6) and find that the BCR w of size f; consists of points satisfying 


[f (x| 6*)] /Lp (=| 87)] > ¢: (62), (23.76) 
where c; is a non-negative constant. (23.76) will hold for every value of 7. ‘Thus for 
testing the composite hypothesis that any of p (x| 6?) holds (¢ = 1, 2,..., k), we require 
all k of the inequalities (23.76) to be satisfied by w. If we now write km;(u,)/c;(B;) 
for m;(u,) in p(x| 6), as we may since m,(u,) is arbitrary, we have from (23.76), adding 
the inequalities for i = 1, 2,...,k, the necessary and sufficient condition for a BCR 


f (¢|@#) > z 1, (x) m;(u,) f (x | 62). (23.77) 
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This is the required generalization. (22.6) is its special case with k = 1, 1,(x) = k, 
(constant), m,(u,) = 1. (23.77) will play a role for composite hypotheses similar to 
that of (22.6) for simple hypotheses. 


One-sided alternatives 
23.30 Reverting to (23.73), we now investigate the problem of testing 
i 6 = 6, 
against 
Sees Se 
which we discussed in general terms in 23.25. Now that we are dealing with the 
exponential form (23.73), we can show that there is always a UMPU test of H}” against 
H‘, By our discussion in 23.25, if a size-« critical region is unbiassed for H{ against 
H, it is a similar region for testing 0 = 0). 
Consider testing the simple 
Hy: =%, p= 


i7st=0 > 0, += o*. 
We now apply the result of 23.29. Putting k = 1, /,(x)=1, «, =a, 0 = (6,), 
6, = (9),), 6* = (0*, p*), 02 = (65, -p°), uy = t, we have the result that the BCR 
for testing H, against H, is defined from (23.77) and (23.73) as 


C(0*, b*) exp {o* s(x)+ z wr t; =} 


Cm b*)exp {bos(x) +E vet(a)} 


This may be rewritten 


against the simple 


> m,(t). (23.78) 


s(x) (6*—05) > c,(t, O*, Oo, p*, ). (23.79) 
We now see that c, is not a function of \, for since, by 23.28, t is a sufficient statistic 
for when H, holds, the value of c, for given t will be independent of W°, p*. Further, 
from (23.79) we see that so long as the sign of (0*—6,) does not change, the BCR will 
consist of the largest 100« per cent of the distribution of s(x) given 6). We thus have 
a BCR for 0 = 6, against 0 > 0), giving a UMP test. This UMP test cannot have 
smaller power than a randomized test against 0 > 0) which ignores the observations. 
The latter test has power equal to its size «, so the UMP test is unbiassed against 0 > 6, 
i.e. by 23.25 itis UMPU. Its size for 0 < 0, will not exceed its size at 09, as is evident 
from the consideration that the critical region (23.79) has minimum power against 
6 < 6, and therefore its power (size) there is less than «. ‘Thus finally we have shown 
that the largest 100« per cent of the conditional distribution of s(x), given t, gives a 
UMPU size-« test of H$ against H}”. 


Two-sided alternatives 
23.31 We now consider the problem of testing 
fs on 7] = 0, 
against 
HH? 06. 
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Our earlier examples stopped short of establishing UMPU tests for two-sided hypo- 
theses of this kind (cf. Examples 23.12 and 23.13). Nevertheless a UMPU test does 
exist for the linear exponential form (23.73). 

From 23.25 we have that if the power function of a critical region is continuous 
in 9, and unbiassed, it is similar for H?. Now for any region w, the power function is 


P(w|6) = | F@I9, bp) dx, (23.80) 


where f is defined by (23.73). (23.80) is continuous and differentiable under the integral 
sign with respect to 0. For the test based on the critical region w to be unbiassed we 
must therefore have, for each value of W, the necessary condition 


Pile |G.) =”. (23.81) 
Differentiating (23.80) under the integral sign and using (23.73) and (23.81), we find 
the condition for unbiassedness 


E{s(x)c(w)} = —aC’ Go )/C (Go). (23.82) 


or 


Since, from (23.73), 
1/C(6,) = | h(x) exp {6s (x) + ae (x) dx 


we have 
Pte | 
and putting (23.83) into (23.82) gives 
E{s(x)c(w)} = aE {s(x)}. (23.84) 


Taking the expectation first conditionally upon the value of t, and then unconditionally, 
(23.84) gives 


E, [E {s (x) c(w)—as(x)|t}] = 0. (23.85) 
Since t is complete, (23.85) implies 
E{ s(x) c(w)—as(x)|t} = 0 (23.86) 
and since all similar regions for H satisfy 
E{c(w)|t} = a, (23.87) 
(23.86) and (23.87) combine into 
Efs* +(x) (a) | t} = a £{s*(x)|t} = ¢, ¢ = 1,2. (23.88) 


All our expectations are taken when 0, holds. 
Now consider a simple 
H,:6=%, p=" 
against the simple 
H,:0 = 6* # 6), uy) = *, 
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and apply the result of 23.29 with k = 2, «; as in (23.88), 9. = (0, ), 0; = 0. = (Ao,), 
6* — (6*, b*), 69 = 02 = (6,,p°), J;(x) = s(x), wu) = ug =t. We find that the 
BCR w for testing H, against H, is given by (23.77) and (23.73) as 

COO*, ptyexp {0% s(x) + E ores} 

= = “> m,(t)+s(x)m,(t). (23.89) 
Co Pexp {0o5(a)+ E hrs(a)} 
i=1 
(23.89) reduces to 
exp {5 (x) (0* —05)} 2 Cy (t, 0*, Oo, p*, p°) +5 (x) co(t, 0*, Oo, p*, ‘p°) 


or 


exp { s(x) (0* —6,)}—s(X)C, 2 ¢4. (23.90) 
(23.90) is equivalent to s(x) lying outside an interval, Le. 
s(x) < v(t), s(x) > w(t), (23.91) 


where u(t) < w(t) are possibly functions also of the parameters. We now show that 
they are not dependent on the parameters other than 0). As before, the sufficiency 
of t for rules out the dependence of v and w on when t is given. ‘That they do 
not depend on 6* follows at once from (23.86), which states that when H, holds 


| _{o(@)|t} fax = « | _ (wl) fax. (23.92) 


The right-hand side of (23.92), which is integrated over the whole sample space, clearly 
does not depend on 6* at all. Hence the left-hand side is also independent of 6*, 
so that the BCR w defined by (23.91) depends only on 6,, as it must. The BCR there- 
fore gives a UMP test of H® against H\?. Its unbiassedness follows by precisely the 
argument at the end of 23.30. Thus, finally, we have established that the BCR defined 
by (23.91) gives a UMPU test of H® against Hi. If we determine from the condi- 
tional distribution of s(x), given t, an interval which excludes 100« per cent of the 
distribution when H® holds, and take the excluded values as our critical region, then 
if the region is unbiassed it gives the UMPU size-« test. 


Finite-interval hypotheses 
23.32 We may also consider the hypothesis 
HY :0,<0< 0, 


against B® «fix Hy ere Ose BZ 
or the complementary Fi 8 = OG, ort 
against PY 30, <0 < G;. 


We now set up two hypotheses 
Hy: 6 = 9, p=, Hy :6 = 6, p=}, 
to be tested against 
H,:0 = 6*, p=*, where 6, 4 0* # 4,. 
We use the result of 23.29 again, this time with k = 2, a, =a, =a, 0 = (6, ), 
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9, = (9%, Y), 0, = (6:,), O* = (0%, *), 0? = (9,°), 02 = (01, )"), U(x) = 1, 
u, =U, =t. We find that the BCR w for testing H, or Hj against H, is defined by 


Ff (%/6*, p*) > my (t) f(*/ Go, p°) + m2 (t) f (| O1, ?). (23.93) 
On substituting f(x) from (23.73), (23.93) is equivalent to 
H(s) = c exp { (0)—6*) s(x) }+c,exp { (0, —6*) s(x)} < 1, (23.94) 


where c,, ¢, may be functions of all the parameters and of t. If 6) < 0* < 6,, (23.94) 
requires that s(x) lie inside an interval, 1.e. 


u(t) < s(x) < w(t). (23.95) 


On the other hand, if 0* < 6, or 6* > 6,, (23.94) requires that s(x) lie outside the 
interval (v(t), w(t)). The proof that the end-points of the interval are not dependent 
on the values of the parameters, other than 6, and 6,, follows the same lines as before, 
as does the proof of unbiassedness. ‘Thus we have a UMPU test for H and another 
for H\. ‘The test is similar at values 6) and 0,, as follows from 23.25. ‘To obtain a 
UMPU test for Hj (or Hf"), we determine an interval in the distribution of s(x) for 
given t which excludes (for H* includes) 100« per cent of the distribution both when 
6 = 0,and 0 = 6,. The excluded (or included) region, if unbiassed, will give a UMPU 
test of 20" (or Ay). 


23.33 We now turn to some applications of the fundamental results of 23.30-2 
concerning UMPU tests for the exponential family of distributions. We first mention 
briefly that in Example 23.11 and Exercises 22.1-3 above, UMPU tests for all four 
types of hypothesis are obtained directly from the distribution of the single sufficient 
statistic, no conditional distribution being involved since there is no nuisance parameter. 


Example 23.14 


For n independent observations from a normal distribution, the statistics (, s?) are 
jointly sufficient for (u,o*), with joint distribution (cf. Example 17.17) 


- re: 2 (x — u)? 
£ (%, s?| uw, 0) oc a XP {- aoe (23.96) 


(23.96) may be written 
7 <Oices { (—4E 4%) (<:) + (2x) (4)}. (23.97) 


which is of form (23.73). Remembering the discussion of 23.27, we now consider 
a linear form in the parameters of (23.97). We put 


9=A (=) +B (“). (23.98) 


where A and B are arbitrary known constants. We specialize A and B to obtain from 
the results of 23.30-2 UMPU tests for the following hypotheses : 


(1) Put 4d = 1, B =0 and test hypotheses concerning 6 = z with y = ~ as 
oO 


2? o2 


nuisance parameter. Here s(x) = —}2 x? and t(x) = Xx. From (23.97) there is 
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a UMPU test of Hi, HY, HP and Hf concerning 1/o?, and hence concerning o”, 
based on the conditional distribution of Xx? given Xx, 1.e. of U(x—#)? given Ux. 
Since these two statistics are independently distributed, we may use the unconditional 
distribution of X(x—<«)*, or of X(x—<x)?/o*, which is a y? distribution with (n—1) 
degrees of freedom. H' was discussed in Examples 23.12-13, where the UMP 
similar test was given for 0 = 0, against one-sided alternatives and an unbiassed test 
based on X(x—)? given for HY; it now follows that this is a UMPU test for H>”, 
while the one-sided test is UMPU for Hj. 


Graphs of the critical values of the UMPU tests of Hy» and H,“ for « = 0-05, 0:10 
are given by Guenther and Whitcomb (1966). 


(2) 'To test hypotheses concerning yw, invert (23.98) into w = (@07—A)/B. 

If we specify a value “, for ~, we cannot choose A and B to make this correspond 
uniquely to a value 0, for 6 (without knowledge of o”) if 6, # 0. But if 6, = 0 we 
have “4, = —A/B. ‘Thus from our UMPU tests for Hf): 6 < 0, Hf: 0 = 0, we get 
UMPU tests of w < wy, and of w = uy. We use (23.71) to see that the test statistic 
s(x) | t ishere (—4 Xx?) A + (Xx) Bgiven an orthogonal function, say (— 3 2 x?) B—(2x)A. 
This reduces to the conditional distribution of Xx given Xx, Clearly we cannot get 
tests of H® or H for w in this case. 

The test of 4 = wy against one-sided alternatives has been discussed in Example 23.7, 
where we saw that the ‘“‘ Student’s ” ¢ test to which it reduces is the UMP similar test 
of u = My against one-sided alternatives. This test is now seen to be UMPU for H’. 
It also follows that the two-sided ‘“‘ equal-tails”” ‘‘ Student’s””’ t-test, which is un- 
biassed for H against HY, is the UMPU test of H?. 


Example 23.15 
Consider k independent samples of n; (i = 1,2,...,) observations from normal 


k 
distributions with means ; and common variance o?, Writen = X n,. It is easily 
i=1 


koem™m 
confirmed that the k sample means #; and the pooled sum of squares S? = X & (x;;—<;)* 
i=1 j=1 


are jointly sufficient for the (k+1) parameters. ‘The joint distribution of the sufficient 
statistics is 


n—k—92 
ee) eee (23.99) 
o” 20° ij 


(23.99) is a simple generalization of (23.96), obtained by using the independence of 
the «; of each other and of S?, and the fact that S?/o? has a y? distribution with (—k) 
degrees of freedom. (23.99) may be written 


2 1 He 
g «x C(u;, 07) exp {( —} Z = <i) (5) — (= +) (4) \ (23.100) 


in the form (23.73). We now consider the linear function 


1 k LM; 
= A(—)+> B,(4). (23.101) 
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i por oe te = 0 GIL Then oS > 
set of nuisance parameters. ‘There is a UMPU test of each of the four H{? discussed 


and y= G@=1,..-,4) is the 


in 23.30-—2 for : and therefore for o2. ‘The tests are based on the conditional dis- 
tribution of & Xx given the vector (2 %1;, U%9;,..., Xi xp;), i.e. of S? = & X (x43 — %;)? 
J a 


J J J 
given that vector. Just as in Example 23.14, this leads to the use of the unconditional 
distribution of S? to obtain the UMPU tests. 
(2) Exactly analogous considerations to those of Example 23.14 (2) show that by 


k 
putting 6, = 0, we obtain UMPU tests of & ¢;u; < Co, Vez; = Co, Where Cy is any 
i=1 


constant. (Cf. Exercise 23.19.) Just as before, no “interval” hypothesis can be 
tested, using this method, concerning the linear form 2¢; “;. 

(3) The substitution k = 2, c, = 1, cg = —1, ¢) = 9, reduces (2) to testing 
H® :u,—us. < 0, H®:,-w, = 0. The test of w,;—"u, = 0 has been discussed 
in Example 23.8, where it was shown to reduce to a “ Student’s” t-test and to be 
UMP similar. It is now seen to be UMPU for H)’. The “ equal-tails ” two-sided 
“ Student’s ” t-test, which is unbiassed, is also seen to be UMPU for HH”. 


Example 23.16 
We generalize the situation in Example 23.15 by allowing the variances of the 
k normal distributions to differ. We now have a set of 2& sufficient statistics for the 


Nn nN 
2k parameters, which are the sample sums and sums of squares ti; 2 8G, P= 4, 
| j=1 


J 
2,...,k. We now write 
k 1 k Us; 
6= 2A; (3) + i B; (4), (23.102) 
i=1 0; uct 0; 


(1) Put B; = 0 (all 7). We get UMPU tests for all four hypotheses concerning 


6 = LA; (-:), 
i 0% 


a weighted sum of the reciprocals of the population variances. The case k = 2 reduces 
this to 


ja 
— 
If we want to test hypotheses concerning the variance ratio o3/o7, then just as in (2) 
of Examples 23.14-15, we have to put 6 = 0 to make any progress. If we do this, 
the UMPU tests of 6 = 0, < 0 reduce to those of 
ee 
O71 A, 
and we therefore have UMPU tests of H®) and Hi concerning the variance ratio. 
The joint distribution of the four sufficient statistics may be written 


1 1 
2 (2x44, Ui Xy;, UX; D X3;) CC C (Mi, 07) exp {-3 (5. LU xij+ Sx) rr Say +E | ° 
07 05 OF 05 
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By 23.27, the coefficient s(x) of 6 when (23.103) is transformed to make 6 one of its 
parameters, will be the same function of —42 xj, —4 2x3; as 6 is of 1/0, 1/03, ie. 
—2s(x) = A,X xjj+ A, Ux, 
and the UMPU tests will be based on the conditional distribution of s(x) given any 
three functions of the sufficient statistics, orthogonal to s(x) and to each other, say 

3 Ux; UX, and A,dxij,—A,Dx§,. 


This is equivalent to holding #,,%, and t = E (y—#,)§— FEE (ay H) fixed, so 
2 


that s(x) is equivalent to E (wy) + S2E (ay) for fixed ¢. In turn, this is 
1 


equivalent to considering the distribution of the ratio & (x,;—*,)?/X (*%2;—,)?, so that 
the UMPU tests of Hf, H are based on the distribution of the sample variance ratio— 
cf. Exercises 23.14 and 23.17. 

(2) We cannot get UMPU tests concerning functions of the yw; free of the o?, as is 
obvious from (23.102). In the case k = 2, this precludes us from finding a solution 
to the problem of two means by this method. 


23.34 The results of 23.27-33 may be better appreciated with the help of a partly 
geometrical explanation. From (23.73), the characteristic function of s(x) is 


= Pe eek, 
o¢(u) = Ef{exp(ius)} = CO+in) (23.103) 
so that its cumulant-generating function is 
y(u) = logd(u) = logC (6) —log C(0+in). (23.104) 
From the form of (23.104), it is clear that the rth cumulant of s(x) is 
or or 
a Liar? | = — = logC (6), (23.105) 
whence 
pipe yok — < tog C (0) (23.106) 
and : 
or-1 
kK, = agra E (s), ro (23.107) 


Consider the derivative 
Dif = = f(x|0,¥). 
From (23.73) and (23.106), 
Df = {s+ COU = (s-E OM, (23.108) 
By Leibniz’s rule, we have from (23.108) 
Dif = D*?[{s—E(s) jf] 
= (-2O}Df+ = (47")Ds-ZOHD oy], — 23.108) 
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which, using (23.107), may be written 
9-1 /q-1 . 
D'f = {s—E(s)}D* 1f-— & (? ) xeesDif (23.110) 
i=1 


1 

23.35 Now consider any critical region w of size «. Its power function is defined 
at (23.80), and we may alternatively express this as an integral in the sample space of 
the sufficient statistics (s,t) by 
P(w|6) = | fas dt, (23.111) 


where f now stands for the joint frequency function of (s, t), which is of the form (23.73) 
as we have seen. ‘The derivatives of the power function (23.111) are 


P®(w| 6) = | D*f ds dt, (23.112) 


since we may differentiate under the integral sign in (23.111). Using (23.108) and 
(23.110), (23.111) gives 


P’ (w|6) = | {s—E(s)} f dsdt = cov{s, e(w)}, (23.113) 
and 
P (w|6) = | {s-E(@)}Def deat =(%5') Muir EO, Ge 2, 
(23.114) 


a recurrence relation which enables us to build up the value of any derivative from 
lower derivatives. In particular, (23.114) gives 
P" (w|6) = cov{[s— E(s)]?, c(w)}, (23.115) 
P'” (w|6) = cov{[s—E(s)]°, e(w)}—3 «2 P’ (w| 6), } 23.116 
P) (w| 6) = cov{[s—E(s)]*, c(w)}—6 x, P” (w|6)—4 «3 P’ (w| 6). Cal 
(23.113) and (23.115) show that the first two derivatives are simply the covariances 
of c(w) with s, and with the squared deviation of s from its mean, respectively. ‘The 
third and fourth derivatives given by (23.116) are more complicated functions of covari- 
ances and of the cumulants of s, as are the higher derivatives. 


23.36 We are now in a position to interpret geometrically some of the results of 
23.27-33. 'To maximize the power we must choose w to maximize, for all admissible 
alternatives, the covariance of c(w) with s, or some function of s—E(s), in accordance 
with (23.113) and (23.114). In the (r+1)-dimensional space of the sufficient statistics, 
(s,t), it is obvious that this will be done by confining ourselves to the subspace ortho- 
gonal to the r co-ordinates corresponding to the components of t, i.e. by confining our- 
selves to the conditional distribution of s given t. 

If we are testing 0 = 6, against 0 > 6) we maximize P(w|6) for all 6 > 0, by 
maximizing P’(w|6), i.e. by maximizing cov(s,c(w)) for all 6 > 6). This is easily 
seen to be done if w consists of the 100« per cent Jargest values of the distribution of 
sgivent. Similarly for testing 0 = 6) against 0 < 0), we maximize P by mininuzing P’, 
and this is done if w consists of the 100« per cent smallest values of the distribution of 
s given t. Since P’(w| 6) is always of the same sign, the one-sided tests are unbiassed. 
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For the two-sided Hj of 23.31, (23.81) and (23.115) require us to maximize 
P" (w| 6), i.e. cov {[s— E(s)]?, c(w) }. By exactly the same argument as in the one-sided 
case, we choose w to include the 100« per cent largest values of {s—E(s)}%, so that 
we obtain a two-sided test, which is only an “ equal-tails ” test if the distribution of s 
given t is symmetrical. It follows that the boundaries of the UMPU critical region 
are equidistant from E(s|t). 


Ancillary statistics: a conditionality principle 


23.37 We have seen that there is always a set of r+s (r > 1, 5 > 0) statistics, 
written (T;, T';), which are minimal sufficient for k+/1 (k > 1, 1 > Q) parameters, 
which we shall write (6,, 6;). Suppose now that the subset 7, has a distribution free 
of 6,. (This is only possible if the distribution of (T,,, T,) is not complete—cf. Exercise 
23.7.) We then have the factorization of the Likelihood Function into 

L(x|6;,6:) = g(T;,, Ts | 0x, 0,) A(x) 
= 81 (77 | Ts; Oxy 01) 82 (Ts | 01) h(x). (23.117) 
This is (21.95) in different notation. 

Fisher (e.g., 1956) calls 7, an ancillary statistic, while Bartlett (e.g., 1939) calls the 
conditional statistic (7',| T',) a quast-sufficient statistic for 0,, the term arising from the 
resemblance of (23.117) when 6; is known to the factorization (17.84) which characterizes 
a sufficient statistic. 

Fisher has suggested a Conditionality Principle for statistical inference in general, 
and testing hypotheses in particular: if T, is distributed free of 6, as in (23.117), the 
conditional distribution of T,| 7, is all that we need to consider in making inferences 
about 6;. Now if T, is sufficient for 0, when 0; is known, it immediately follows that 


(23.117) becomes L(x] 61,9) = 91(T,| Ts, 9x) 80 (Ts | 9,) h(x), (23.118) 


and the two distributions of (7,| T,) and T, are separated off, each depending on a 
separate parameter and each sufficient for its parameter. There is then no doubt that, 
in accordance with the general principle of 23.3, we may confine ourselves to functions 
of (T,| T,) in testing 0,. 

However, the real question is whether we should confine ourselves to the conditional 
statistic when T; is not sufficient for 0,, for Welch ( 1939) gave an example of.a simple 
hypothesis concerning the mean of a rectangular distribution with known range which 
showed that the conditional test based on (7,.| T,) may be uniformly less powerful 
than an alternative (unconditional) test. 

This question has far-reaching implications, since A. Birnbaum (1962) has shown 
that the Conditionality Principle implies (as well as being obviously implied by) the 
Likelihood Principle, which states (cf. 18.32) that only the LF need be regarded in 
making any statistical inference from observations. In particular, this has the con- 
sequence that the details of the sampling procedure which produced the observations 
(and the LF) are strictly irrelevant to subsequent statistical inference. 

Many, perhaps most, statisticians will find it intuitively unacceptable to eliminate 
the sample space from consideration in making inferences from observations. If 
so, they must reject the Likelihood Principle, and the Conditionality Principle must 
automatically go with it. 
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Example 23.17 


We have seen (Example 17.17) that in normal samples the pair (%,s*) is jointly 
sufficient for (u, 02), and we know that the distribution of s? does not depend on vp. 


Thus we have 
L (| u, 0?) = 81 [(%|s*)| ms 07] 2(9?| 07) 2 (*)s 

a case of (23.117) with k=J=r=s=1. The ancillary principle states that the 
conditional statistic | s? is to be used in testing hypotheses about yu. (It happens that 
% is actually independent of s? in this case, but this is merely a simplification irrelevant 
to the general argument.) But s? is not a sufficient statistic for the nuisance parameter 
o?, so that the distribution of & | s? is not free of o2. If we have no prior distribution 
given for o? we can only make progress by integrating out o? in some more or less 
arbitrary way. If we are prepared to use its fiducial distribution and integrate over 
that, we arrive back at the discussion of 21.10, where we found that this gives the same 
result as that obtained from the standpoint of maximizing power in Examples 23.7 
and 23.14, namely that ‘“ Student’s” ¢-distribution should be used. 


Another conditional test principle 

23.38 Another principle of test construction may be invoked (cf. D. R. Cox 
(1958a)) to suggest the use of (T,| T,) whenever T, is sufficient for 6, when 6; 1s 
known, irrespective of whether its distribution depends on 6,, for then we have 


L («| 9,92) = £1(Tr| Ts) 9x) 82(Ts | 6,) h(x), (23.119) 
so that the conditional statistic is distributed independently of the nuisance parameter 
0, Here again, we have no obvious reason to suppose that the test is optimum in 
any sense. (23.119) is (21.96) in different notation. 


The justification of conditional tests 

23.39 The results of 23.30-2 enable us to see that, if the distribution of the 
sufficient statistics (T,, T,) is of the exponential form (23.73), then the use of the con- 
ditional distribution of 7, for given 7, will give UMPU tests, for in our previous 
notation the statistic 7’, is s (x) and T, is t (x), and we have seen that the UMPU tests 
are always based on the distribution of T.. for given T,. If the sufficient statistics are 
not distributed in the form (23.73) (e.g. in the case of a distribution with range depend- 
ing on the parameters) this justification is no longer valid. However, following 
Lindley (1958b), we may derive a further justification of the conditional statistic 7’, | 7; 
provided only that the distribution of T., g2(T'; | %% 91), is boundedly complete when 
H, holds and that T;, is then sufficient for 0, For then, by 23.19, every size-« critical 
region similar with respect to 6; will consist of a fraction « of all surfaces of constant 
T,. Thus any similar test of Hy will be a conditional test based on T,| 7, and any 
optimum conditional test will be an optimum similar test. 

Welch’s (1939) counter-example, which is given in Exercise 23.23, falls within the 
scope of neither of our justifications of the use of conditional test statistics, for there 
the two-component minimal sufficient statistic for the single parameter is not complete, 


4 


so that Exercise 23.7 does not preclude the existence of an ancillary statistic (the sample 
range R). 
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EXERCISES 


23.1 Show that for samples of 2 observations from a normal distribution with mean 
6 and variance o*, no symmetric similar region with respect to 0 and o? exists for n < 2, 
but that such regions do exist for n > 3. 
(Feller, 1938) 


23.2 Show, as in Example 23.1, that for a sample of n observations, the ith of which 
has distribution 


1 
——-e~% xt —1 dy, , Gx = cos 0; 0, 


no similar size-« region exists for 0 < « < 1. 


(Feller, 1938) 


dlog L 
06 


23.3 If L(x|6) is a Likelihood Function and E ( = 0, show that if the dis- 


log L 
00 
lary, show that no similar region with respect to 6 exists if no statistic exists which is 
dlog L 

00 


tribution of a statistic z does not depend on @ then cov { 2, = (0. As a corol- 


uncorrelated with 
(Neyman, 1938a) 


23.4 Show, using the c.f. of 2, that the converses of the result and the corollary of 


Exercise 23.3 are true. 
dlog L 


06 


Together, this exercise and the last state that cov (= ) = Q is a necessary 


= tor L : 
and sufficient condition for cov ( —) = 0, where uw is a dummy variable. 
(Neyman, 1938a) 


23.5 Show that the Cauchy family of distributions 


dx 
dF = ——_——_-— —~oO<x< Oo, 


cc) 


is not complete. 


(Lehmann and Scheffé, 1950) 


23.6 Show that if a statistic 2 is distributed independently of t, a sufficient statistic 
for 6, then the distribution of z does not depend on @. : 


23.7 In Exercise 23.6, write H,(z) for the d.f. of z, H.(z|t) for its conditional d.f. 
given t, and g(t| 6) for the frequency function of t. Show that 


| {H, (z)— H(z |t)}g(t| 0) dt = 0 
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for all 6. Hence show that if t is a complete sufficient statistic for 0, the converse of the 
result of Exercise 23.6 holds, namely, if the distribution of z does not depend upon 9, 


z is distributed independently of t. 
(Basu, 1955) 


23.8 Use the result of Exercise 23.7 to show directly that, in univariate normal 
samples : 
(a) any moment about the sample mean « is distributed independently of X ; 
(b) the quadratic form x’ Ax is distributed independently of * if and only if the 
elements of each row of the matrix A add to zero (cf. 15.15) ; 
(c) the sample range is distributed independently of x ; 
(d) (x(n) -%)/(xm) — x1) is distributed independently both of x and of s*, the sample 


variance. 
(Hogg and Craig, 1956) 


23.9 Use Exercise 23.7 to show that: 

(a) in samples from a bivariate normal distribution with p = 0, the sample correlation 
coefficient is distributed independently of the sample means and variances (cf. 
16.28) ; 

(b) in independent samples from two univariate normal populations with the same 
variance o”, the statistic 


- (x33 — %1)?/(m, —1) 
~ © (a9 —2)*/(ta— 1) 
j 


is distributed independently of the set of three jointly sufficient statistics 
Xi, Xey p> (x13 — 4)? +2 (%25 — % 2)" 
j j 


and therefore of the statistic 


2 = ————__. 
DX (% 15 — ¥3)? + Ui (x25 — Kg)? Ny + Ne 


which is a function of the sufficient statistics. This holds whether or not the popu- 
lation means are equal. 


(4-7 aco | 


(Hogg and Craig, 1956) 
23.10 In samples of size n from the distribution 


dF = exp {—(x—6)} dx, Cy =O, 
show that x) is distributed independently of 


r 
. = % (a — x(1)) + (n we r) (x(r) = X(1))s rin. 
= 
(Epstein and Sobel, 1954) 


23.11 Show that for the binomial distribution with parameter 2, the sample propor- 


tion p is minimal sufficient for z. 
(Lehmann and Scheffé, 1950) 


23.12 For the rectangular distribution 
dF=dx, 0-4 <x < 6+}, 


show that the pair of statistics (x1), x(n)) is minimal sufficient for 6. 


(Lehmann and Scheffé, 1950) 
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23.13 For a normal distribution with variance o? and unspecified mean ym, show by 
the method of 23.20 that the UMP similar test of H, : o? = o? against H, : o? = o7 takes 
the form 

L(x—x)? >a, if of > of, 
U(x—*)? <b, if a? < oF. 


23.14 ‘Two normal distributions have unspecified means and variances o*, 007. From 
independent samples of sizes n,, m2, show by the method of 23.20 that the UMP similar 
test of H,:0 = 1 against H,:6 = 0, takes the form 

fina, “2 6, > 1, 
Se Gb, cat Beit, 
where s?, sj; are the sample variances. 


(Harter (1963) shows that a test based on the ratio of sample 
ranges is almost as powerful as the UMP similar test.) 


23.15 Independent samples, each of size n, are taken from the distributions 
dF = exp ce dx/@,,\ @,, 6, > 0, 
1 


dG = exp(—¥0,) 0, dy, 0<x, y < ©. 


Show that t = (Xx, Xy) = (X, Y) is minimal sufficient for (6,, 6,) and remains so if 
H,:9, = 9, = 9 holds. By considering the function X Y— E(X Y) show that the distri- 
bution of t is not boundedly complete given Hoy, so that not all similar regions satisfy 
(23.13). Finally, show that the statistic xy is then distributed independently of 6, so 


that H, may be tested by similar regions from the distribution of XY. 
(Watson, 1957a) 


23.16 In Example 23.14, show from (23.98) that there is a UMPU test of the 
hypothesis that the parameter point (4, 0) lies between the two parabolas 


= Mo + C10", b= Mot, 0", 


tangent to each other at (M, 0). 
(Lehmann and Scheffé, 1955) 


23.17 In Exercise 23.14, show that the critical region 
si/s3 > Ara, < dry, 


is biassed against the two-sided alternative H,:6 #1 unless n,; = n,. By exactly the 
same argument as in Example 23.12, show that an unbiassed critical region 


‘= st /s) 2 Q— a) < Bess hj, + % = &, 
is determined by the condition (cf. (23.56) ) 
Vif Va,) sd Vea J V2); = 


where f is the frequency function of the variance-ratio statistic t and V,, its 100% per cent 
point. Show that the power function of the unbiassed test is monotone increasing for 
6 > 1, monotone decreasing for 6 < 1. 


(Ramachandran (1958) gives values of Vi_,,, Vz, for « = 0°05, 
n,—1 and m.—1 = 2(1)4(2)12(4)24; 30, 40, 60) 
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23.18 In Exercise 23.17, show that the unbiassed confidence interval for 9 given by 
& oo minimizes the expectation of (log U—log L) for confidence intervals (L, U) 
based on the tails of the distribution of t. 

(Scheffé, 1942b) 

23.19 In Example 23.15, use 23.27 to show that the UMPU tests for Xicimi are 
based on the distribution of 


2 2\2 
i= Bete—ay /{ 2st 5) 


which is a “ Student’s”’ t with (n—k) degrees of freedom. 


23.20 In Example 23.16, show that there is a UMPU test of the hypothesis 


be o% Paver 
Fae tF YJ, by #0. 


23.21 For independent samples from two Poisson distributions with parameters 
ly, /42, Show that there are UMPU tests for all four hypotheses considered in 23,30-2 
concerning f4,//12, and that the test of 4,/“, = 1 consists of testing whether the sum of 
the observations is binomially distributed between the samples with equal probabilities. 

(Lehmann and Scheffé, 1955) 


23.22 For independent binomial distributions with parameters 6,, 94, find the DMPU 
. 6 i] 
tests for all four hypotheses in 23.30-2 concerning the ‘‘ odds ratio ” cay / ( = ) 
meses | 


1-6, 


and the UMPU tests for 0, = 02, 9, < 43. 
(Lehmann and Scheffé, 1955) 


23.23 For the rectangular distribution 
ar =ax, 6-i <x < 0443, 

the conditional distribution of the midrange M given the range R, and the marginal 
distribution of M, are given by the results of Exercises 14.12, 14.13 and 14.16. For 
testing H,: 6 = 0, against the two-sided alternative H,:0 4 0 show that the ‘‘ equal- 
tails’ test based on M given R, when integrated over all values of R, gives uniformly less 
power than the “ equal-tails’ test based on the marginal distribution of M; use the 
value « = 0:08 for convenience. 


(Welch, 1939) 


23.24 In Example 23.9, show that the UMPU test of Hy: o = 6, against H,:0 # 0 
is of the form 


(Lehmann, 1947) 


93.25 For the distribution of Example 23.9, show that the UMP similar test of 
H,: 6 = 0, against H,:0 # 4, is of the form 


—6 
— < 0, = Cy 


(Lehmann, 1947) 
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23.26 For the rectangular distribution 
dF = dx/6, Bex <putd, 
show that the UMP similar test of Hy: u = my against H,:u # My is of the form 
x 
2 SF ee ee 
X(n) — (1) 
Cf. the simple hypothesis with 0 = 1, where it was seen in Exercise 22.8 that no UMP 
test exists. 


(Lehmann, 1947) 
Pam = ae Ss Seer . are independent observations from the distribution 


F= p—1 < < : 
d. aan —x /0)x?— dx, p>0,0<x<o 


use Exercises 23.6 and 23.7 to show that a ey and sufficient condition that a statistic 
h(x, . . +, Xn) be independent of S = = xi is that h(x,, ..., Xn) be homogeneous of 


degree zero in x. (Cf. refs. to Picmcse 15. 22.) 


23.28 From (23.113) and (23.114), show that if the first non-zero derivative of the 
power function is the mth, then 


P™) (w|60) = cov{[s—E(s)]™, c(w) } 
and 
{P™(w|6)}* _ 1 
Mom ~ & 
where pu, is the rth central moment of s._ In particular, 


| P’(w|6)| < 343. 


23.29 From 23.35, show that w is a similar region for a hypothesis for which @ is a 
nuisance parameter if and only if 


cov {s, c(w) } = 0 
identically in 6. Cf. Exercises 23.3—4. 


23.30 Generalize the argument of the last paragraph of Example 23.7 to show that 


for any distribution of form 
a 
ar =f ( : ) = 


admitting a complete sufficient statistic for o when mu is known, there can be no similar 
critical region for Hy: uw = My against H,: uw = my with power independent of a. 


23.31 For a normal distribution with mean and variance both equal to 0, show 
that for a single observation, x and x? are both sufficient for 6, x? being minimal. Hence 
it follows that single sufficiency does not imply minimal sufficiency. (Exercise 18.13 
gives a bivariate instance of the same phenomenon.) 


CHAPTER 24 


LIKELIHOOD RATIO TESTS AND THE 
GENERAL LINEAR HYPOTHESIS 


24.1 The ML method discussed in Chapter 18 is a constructive method of obtain- 
ing estimators which, under certain conditions, have desirable properties. A method 
of test construction closely allied to it is the Likelihood Ratio (LR) method, proposed 
by Neyman and Pearson (1928). It has played a similar role in the theory of tests 
to that of the ML method in the theory of estimation. 

As before, we have a LF 


L(x|6) = I f (x9), 


where 6 = (6,,0,) is a vector of r+s = k parameters (r > 1, s > 0) and x may also 
be a vector. We wish to test the hypothesis 


Hes 02-0, (24.1) 
which is composite unless s = 0, against 
A, : 0, - 0-9 


We know that there is generally no UMP test in this situation, but that there may 
be a UMPU test—cf. 23.31. 
The LR method first requires us to find the ML estimators of (6,,9,), giving the 

unconditional maximum of the LF 

L(x|6,, 9s), (24.2) 
and also to find the ML estimators of 0,, when H, holds, giving the conditional 
maximum of the LF : 

L(x | 6,65). (24.3) 
6, in (24.3) has been given a double circumflex to emphasize that it does not in general 
coincide with 6, in (24.2). Now consider the likelihood ratio() 

= L(x | 6,0, 9s) 

"= Lieit.,3) 
Since (24.4) is the ratio of a conditional maximum of the LF to its unconditional maxi- 
mum, we clearly have 


(24.4) 


(cdc t (24.5) 


Intuitively, / is a reasonable test statistic for Hy: it is the maximum likelihood under 


(*) When s = 0, H, being simple, no maximization process is needed, for L is uniquely 
determined. 

(t) The ratio is usually denoted by 4, and the LR statistic is sometimes called ‘‘ the lambda 
criterion,” but we use the Roman letter in accordance with the convention that Greek symbols 


are reserved for parameters. 
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H, as a fraction of its largest possible value, and large values of / signify that H, is 
reasonably acceptable. ‘The critical region for the test statistic is therefore 


Ee (24.6) 
where c, is determined from the distribution g(/) of / to give a size-« test, i.e. 
| * ial = «. (24.7) 
0 


24.2 For the LR method to be useful in the construction of similar tests, i.e. tests 
based on similar critical regions, the distribution of / should be free of nuisance para- 
meters, and it is a fact that for most statistical problems it isso. ‘The next two examples 
illustrate the method in cases where it does and does not lead to a similar test. 


Example 24.1 
For the normal distribution 
—— 2 
dF (x) = (22 0?)-* exp {-3 (254) har, 
we wish to test | 


Hy: & = Mo. 
Here 


L(x] 2,0%) = (20%)-exp {-P (54) \ 


Using Example 18.11, we have for the unconditional ML estimators 


fi = &, 
1 
ae LS (eg = et 
a= (4=sy) = #, 
so that 
L(x | i,67) = (27 s?)-?” exp (—4n). (24.8) 


When H, holds, the ML estimator is (cf. Example 18.8) 
2 = 2 E (eyo)? = s+ (BM) 


so that 
L (x | 12,6") = [2x {5° + (%— Mo)? } “exp (— 4m). (24.9) 
From (24.4), (24.8) and (24.9), we find 


cere 
$? + (¥— bo)? 


or 


where ¢ is ‘“‘ Student’s ” ¢-statistic with (a—1) degrees of freedom. Thus / is a mono- 
tone decreasing function of #2. Hence we may use the known exact distribution of £2 
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as equivalent to that of J, rejecting the 100« per cent largest values of ¢?, which corres- 
pond to the 100« per cent smallest values of J. We thus obtain an “ equal-tails ” test 
based on the distribution of “ Student’s ” ¢, half of the critical region consisting of 
extreme positive values, and half of extreme negative values, of ¢. This is a very 
reasonable test: we have seen that it is UMPU for H, in Example 23.14. 


Example 24.2 

Consider again the problem of two means, extensively discussed in Chapters 21 
and 23. We have samples of sizes 7,, m, from normal distributions with means and 
variances (1102), (/42,03) and wish to test Ho: My = Me, which we may re-parametrize 
(cf. 23.2) as Hy: 0 = f@y—f@g = 0. We call the common unknown value of the means yu. 
We have 


My as 2 Ne = 2 
L (| Minden cha) = 2a) o¢%ozmexp{ —3( 3% Cua iay, # Swat!) 


j=1 1 j=1 02 

The unconditional ML estimators are 

fiy = *, fg = Xe, 6= si, 63 = 83, 
so that 

L(x | fh, fha67,63) = (220) Ht”) s™ so" exp { — 3 (41 + Ma) }. 

When H, holds, the ML estimators are roots of the set of three equations 

Mm, (¥1—f) , Me(%o.—m) _ 

oe eee oe 

O71 02 


of = — © (xy—p)? = 84+(8,—p) (24.10) 
Ny j=1 


i> 
of = — © (ty—p)? = $+ (.—p) 
2j=1 
When the solutions of (24.10) are substituted into the LF, we get 
L (x| 4, 63,63) = (20) **™ 67" Ga exp{—} (1 +a) js 


and the likelihood ratio is 


= Sy Ny Se sedis Sf 34 82 | ln, 
(3) (3) (areca lace} (24.11) 


We need then only to determine fi to be able to use (24.11). Now by (24.10), we see 
that fi is a solution of a cubic equation in w whose coefficients are functions of the n; 
and of the sums and sums of squares of the two sets of observations. We cannot 
therefore write down fi as an explicit function, though we can solve for it numerically 
in any given case. Its distribution is, in any case, not independent of the ratio 07/03, 
for f@ is a function of both s? and sj and / is therefore of the form 


1 = g (si, 3) h(s%, 83). 
Thus the LR method fails in this case to give us a similar test. 


24.3 If, as in Example 24.1, we find that the LR test statistic is a one-to-one 
function of some statistic whose distribution is either known exactly (as in that Example) 
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or can be found, there is no difficulty in constructing a valid test of Ho, though we shall 
have shortly to consider what desirable properties LR tests as a class possess. How- 
ever, it frequently occurs that the LR method is not so convenient, when the test statistic 
is a more or less complicated function of the observations whose exact distribution 
cannot be obtained, as in Example 24.2. In such a case, we have to resort to approxi- | 
mations to its distribution. 

Since J is distributed on the interval (0, 1), we see that for any fixed constant c > 0, 
w = —2clog/ will be distributed on the interval (0, 00). It is therefore natural to seek 
an approximation to its distribution by means of a 4? variate, which is also on the 
interval (0, co), adjusting c to make the approximation as close as possible. ‘The 
inclination to use such an approximation is increased by the fact, to be proved in 24.7, 
that as” increases, the distribution of —2log/ when H, holds tends to a x? distribu- 
tion with r degrees of freedom. In fact, we shall be able to find the asymptotic distri- 
bution of —2log/ when H, holds also, but in order to do this we must introduce a 
generalization of the x? distribution. 


The non-central x? distribution 

24.4 Wehave seen in 16.23 that the sum of squares of n independent standardized 
normal variates is distributed in the 7? form with m degrees of freedom, (16.1), and c.f. 
given by (16.3). We now consider the distribution of the statistic 


Yo 2 


1 Ma 
S 


where the x, are still independent normal variates with unit variance, but where their 
means differ from zero and 
E(x;) =i, Upzi = 4. (24.12) 
We write the joint distribution of the x, as 
dF x exp{—}(x—p) (x—p) j I dx, 


and make the orthogonal transformation to a new set of independent normal variates 
with variances unity, 


y = Bx. 
Since 
E(x) = 
o= E(y) = Bu, 
so that 
06 = p'p = 4, (24.13) 


since B’B = I. We now make the first (n— 1) components of 8 equal to zero. ‘Then 
by (24.13), 

aD =A. 
Thus 
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is a sum of squares of m independent normal variates, the first (n—1) of which are 
standardized, and the last of which has mean /? and variance 1. We write 


and we know that w is distributed like y? with (n—1) degrees of freedom. The distri- 
bution of y, 1s 


dF x exp{—4(in—1)? } dyn, 
so that the distribution of wv is 


fii(v) dv 3 (uv? —7)?}+ exp {—4(-—v?—#?)?}] 
las od (vA) 
oc u-? exp { a(o+4)} : 2 (Or)! dv. (24.14) 
The joint distribution of v and wu is 
dG «x f,(v) fe (u) du du, (24.15) 
where f, is the x? distribution with (n—1) degrees of freedom 
fo(u)du oe" u2"—3) dy, (24.16) 


We put (24.14) and (24.16) into (24.15) and make the transformation 


z= u+v, 
u 

w= : 

uUu+v 


with Jacobian equal to z. We find for the joint distribution of z and w 


dG (z, w) oc e~ 244) gilr—2) gyGn—3) (J — wy? = w)" dw dz. 


= 


We nowintegrate out w over its range from 0 to 1 sting for the marginal distribution of 


dH (2) oc e~ #2 +4 gitn—2) : iG (n—1),$+7} dz. (24.17) 


sory oT 
To obtain the constant in (24.17), we a that it does not depend on /, and put 
A= 0. (24.17) should then reduce to (16.1), which is the ordinary y? distribution 
with 2 degrees of ree The non-constant factors agree, but whereas (16.1) has 
a constant term jy, ra ny (24.17) has B{4(n—1), 4} = Di{2 ES D(3 Fe aes 
therefore divide (24.17) by the factor 2?”I' {4 (n—1)}I' (4) and finally, writing » for n, 
we have for any A 


oe 3(Z+A4) nl? — 2) . 3 


aii (z)_= 2° T{(—1)}TQ) = 0 (2r)! )! 


“2 B{A(v—1), dtr} de. (24.18) 


24.5 ‘The distribution (24.18) is called the non-central ¥? distribution with y degrees 
of freedom and non-central parameter A, and sometimes written y’?(v,/). It was first 
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given by Fisher (1928a), and has been studied by Wishart (1932), Patnaik (1949) and 
by Tiku (1965a). Since its first two cumulants are (cf. Exercise 24.1) 


kK, = v+A, 2 
Kg ot ese 
it can be approximated by a (central) y? distribution as follows. The first two cumu- 
lants of a 7? with »* degrees of freedom are (putting 2 = 0, » = »* in (24.19)) 
ees tg SF: (24.20) 
If we equate the first two cumulants of y’? with those of p y?, where p is a constant 
to be determined, we have, from (24.19) and (24.20), 


y+A = pr*, 
2(v+2A) = 2p? r*, 


so that y’2/p is approximately a central y? variate with 


— wee 4, A 
y+A y+A (24.21) 
« tetsy = }2 
Sagas eZ 


y* in general being fractional. 

Patnaik (1949) shows that this approximation to the d.f. of y’? is adequate for many 
purposes, but he also gives better approximations obtained from Edgeworth series 
expansions. 

If »* is large, we may make the approximation simpler by approximating the 7? 
approximating distribution itself, for (cf. 16.6) (27'2/p)? tends to normality with mean 
(2v* — 1)? and variance 1, while, more slowly, 7’?/p becomes normal with mean »* and 
variance 2*. | 

If »—> o, p— 1 and »* ~ »; but if A— o, p—2 and »* ~F/. 


24.6 We may now generalize our derivation of 24.4. Suppose that x is a vector 
of nm multinormal variates with mean p and non-singular dispersion matrix V. We 
can find an orthogonal transformation x = By which reduces the quadratic form 
x’V-1x to the diagonal form y’B’V-!By = y’Cy, the elements of the diagonal of 
C being the latent roots of V-!. To y’Cy we apply a further scaling transformation 
y = Dz, where the leading diagonal elements of the diagonal matrix D are the reci- 
procals of the square roots of the corresponding elements of C, so that D? = C7}. 
Thus x’V-!x = y’Cy = z’z, and z is a vector of m independent normal variates 
with unit variances and mean vector @ satisfying w = BD®. Thus/ = 0'6 = p’V-1p. 
We have now reduced our problem to that considered in 24.4. We see that the distri- 
bution of x’V-1x, where x is a multinormal vector with dispersion matrix V and 
mean vector p, is a non-central y? distribution with m degrees of freedom and non- 
central parameter p’V-1y. This generalizes the result of 15.10 for multinormal variates 
with zero means. 


Graybill and Marsaglia (1957) have generalized the theorems on the distribution of 
quadratic forms in normal variates, discussed in 15.10-21 and Exercises 15.13, 15.17, to 
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the case where x has mean w «0. Idempotency of a matrix is then a necessary and 
sufficient condition that its quadratic form is distributed in a non-central x? distribution, 
and all the theorems of Chapter 15 hold with this modification. 


The asymptotic distribution of the LR statistic 

24.7. We saw in 18.6 that under regularity conditions the ML estimator (tem- 
porarily written 2) of a single parameter 6 attains the MVB asymptotically. It follows 
from 17.17 that the LF is asymptotically of the form 


dlogL—_—, f@logL\ ,, | 
ai. E( = ye 6), (24.22) 
or 
2 
L o exp te (° a =) e—oy, (24.23) 


showing that the LF reduces to the normal distribution of the ‘‘ asymptotically suffi- 


cient” statistic ¢. 
For a k-component vector of parameters 0, the matrix analogue of (24.22) is 


ane = (t-0)' V>, (24.24) 


where V-! is defined by (cf. 18.26) 
2 
Vz} = -E( 5). 


00; 00; 
When integrated, (24.24) gives the analogue of (24.23) 
L ox exp{—4(t—6)’V-1(t—6)}. (24.25) 


We saw in 18.26 that under regularity conditions the vector of ML estimators t is 
asymptotically multinormally distributed with the dispersion matrix V. ‘Thus the LF 
reduces to the multinormal distribution of t. This result was rigorously proved by 
Wald (1943a). 

We may now easily establish the asymptotic distribution of the LR statistic / defined 
at (24.4). In virtue of (24.25), we may reduce the problem to considering the ratio 
of the maximum of the right-hand side of (24.25) given H, to its maximum given //), 
When H, holds, the maximum of (24.25) is when 8 = 8 = t, so that every component 
of (t—6) is equal to zero and we have 

L(x |6,,6,) oc 1. (24.26) 
When H, holds, the s components of (t—6) corresponding to 6, will still be zero, for 
the maximum of (24.25) occurs when 0, = 8, = t,. (24.25) may now be written 
L(x |0,o,8,) oc exp{—4(t,—9,0)’ V;* (tr —9r0) }; (24.27) 
the suffix r denoting that we are now confined to an r-dimensional distribution. ‘Thus, 
from (24.26) and (24.27), 
se L (x8, 0,9.) = == Se i 3-1 fe 
[= L(x |0,,0,) me exp { 2 (t, 8,0) V; (t, 6,0) }- 


Thus 
—2 log 1 = (t,—0,,)’ V;* (t,— 9,5). 
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Now we have seen that t, is multinormal with dispersion matrix V, and mean vector 8,. 
Thus, by the result of 24.6, —2log/ for a hypothesis imposing r constraints is asymp- 
totically distributed in the non-central y? distribution with r degrees of freedom and 
non-central parameter 

A = (0,—6,5)’ V;*(0,—9,o), (24.28) 
a result due to Wald (1943a). When H, holds, 2 = 0 and this reduces to a central 7’? 
distribution with r degrees of freedom, a result originally due to Wilks (1938a). A 
simple rigorous proof of the Hy result is given by K. P. Roy (1957). It should be 
emphasized that these results only hold if the conditions for the asymptotic normality 
and efficiency of the ML estimators are satisfied. 


The asymptotic power of LR tests 

24.8 ‘The result of 24.7 makes it possible to calculate the asymptotic power func- 
tion of the LR test in any case satisfying the conditions for that result to be valid. We 
have first to evaluate the matrix V,-!, and then to evaluate the integral 


Px | ",  dy'2(»,4) (24.29) 
Xq (v,0) 


where '2(v,0) is the 100(1—«) per cent point of the central y? distribution. P is 
the power of the test, and its size when 4 = 0. 


Patnaik (1949) gives a table of P for « = 0-05, degrees of freedom 2 (1) 12 (2) 20 and 
4 = 2(2)20. For 1 degree of freedom, we may use the normal d.f. to evaluate P as in 
Example 24.3 below. Fix (1949b) gives inverse tables of 4 for degrees of freedom 
1 (1) 20 (2) 40 (5) 60 (10) 100, « = 0-05 and « = 0-01 and P (her f) = 0-1 (0-1) 0-9. 
If we use the approximation of 24.5 for the non-central distribution, (24.29) be- 
comes, using (24.21) with » = 7, 


i? 
Pp | eS ee 24.3 
(22) 409 ¢ a) ieee 


r+2A 

where y?(r) is the central y? distribution with r degrees of freedom and y;(r) its 
100(1—«) per cent point. Putting 2 = 0 in (24.30) gives the size of the test. 

The degrees of freedom in (24.30) are usually fractional, and interpolation in the 
tables of y? is necessary. 3 

From the fact that the non-central parameter 4 defined by (24.28) is, under the 
regularity conditions assumed, a quadratic form with the elements of V, 1 as coefficients, 
it follows, since the variances and covariances are of the order n~1, that 2 will have a 
factor n and hence that the power (24.29) tends to 1 as m increases. 


Example 24.3 

To test Hy: 02 = 02 for the normal distribution of Example 24.1. The uncon- 
ditional ML estimators are as given there, so that (24.8) remains the unconditional 
maximum of the LF. Given our present H,, the ML estimator of wis @ = * (Example 
18.2). Thus 


9 


L(x| A, 02) = (2102) exp {- } =e. (24.31) 


232 THE ADVANCED THEORY OF STATISTICS 
The ratio of (24.31) to (24.8) gives 


2\n/2 2 ) 
Bee Ea BSE SS ae 
(a) oP [-{e-1} } 


z= etPn = =e, (24.32) 


so that 


where t = ns*/og. 2% is a monotone function of /, but is not a monotone function of t/n, 
its derivative being 

wing E e-t/n 

eae n 


so that z increases steadily for ¢ < m to a maximum at ¢t = m and then decreases steadily. 
Putting / < c, 1s therefore equivalent to putting 


t< ay t 2 by 
where Aap 9s are determined, using (24.32), by 
P{t-< a,}+P{t > b,} = a, ) 


Aye—%/n = bh, e—ba/n, 


(24.33) 


Since the statistic ¢ has a y? distribution with (n—1) d.f. when H, holds, we can use 
tables of that distribution to satisfy (24.33). 
Now consider the approximate distribution of 


Bs Te (tn) —nlog () 
Since E(t) = n—1, vart = 2(n—1), we may write 


—2log] = (¢—n)—nlog {1.44 


2185 n¥ (— ya (E) [sr 


Se 5 + o(n-) (24.34) 


We have seen (16.6) that, as 7 —> oo, a x? distribution with (n—1) degrees of freedom 
is asymptotically normally distributed with mean (n—1) and variance 2(n—1); or 
equivalently, that (t—m)/(2n)? tends to a standardized normal variate. Its square, 
the first term on the right of (24.34), is therefore a y? variate with 1 degree of freedom. 
This is precisely the distribution of —2log/ given by the general result of 24.7 when 
H, holds. This result also tells us that when H, is false, —2log/ has a non-central 
y? distribution with 1 degree of freedom and non-central parameter, by (24.28), 


e2log L 
b= — BLOT | (toi = 3 (ot oF) 


| 
do! Ss 
yee 
| 
Qa 
nDoicor 
bo 
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Thus the expression (24.30) for the approximate power of the LR test in this case, 


where r = 1, is 
00 A2 
ps | d *(1+ 7253). 24.35 
(5 )8a 4 1+2A ( ) 
For illustrative purposes we shall evaluate P for one value of A and of n conveniently 
chosen. Choose 7%(1) = 3-84 to give a test of size 0-05. Consider the alternative 
o? = oj = 1:2505. We then have 2 = 0-02n, and we choose n = 50 to give 1 = 1. 


(24.35) is then 
r= [-te() 
2-56 3 


and from the Biometrika Tables we find by simple interpolation between 1 and 2 degrees 
of freedom that P = 0-166 approximately. The exact power may be obtained from 
the normal d.f.: it is the power of an equal-tails size-« test against an alternative with 
mean A? = 1 standard deviations distant from the mean on A, i.e. the proportion of 
the alternative distribution lying outside the interval (— 2-96, +0-96) standard devia- 
tions from its mean. ‘The normal tables give the value P = 0-170. The approxima- 
tion to the power function is thus quite accurate enough. 


Closer approximations to the distribution of the LR statistic 


24.9 Confining ourselves now to the distribution of / when H, holds, we may 
seek closer approximations than the asymptotic result of 24.7. As indicated in 24.3, 
if we wish to find y? approximations to the distribution of a function of J, we can gain 
some flexibility by considering the distribution of w = —2clog/ and adjusting c to 
improve the approximation. 

The simplest way of doing this would be to find the expected value of w and adjust c 
so that 

E(@) = 1, 
the expectation of a y? variate with r degrees of freedom. An approximation of this 
kind was first given by Bartlett (1937), and a general method for deriving the value 
of c has been given by Lawley (1956), who uses essentially the methods of 20.15 to 
investigate the moments of —2logl. If 7 
a 1 

ee ee rqi+i+0(-3) } (24.36) 

Lawley shows that by putting either 


WwW, = —2 == log / 
1+ - 
n 


= -2(1-$) Gel 


nN 


(24.37) 


or 


we not only have 
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which follows immediately from (24.36) and (24.37), but also that all the cumulants 

of w conform, to order n-!, with those of a y? distribution with r degrees of freedom. 

The simple scaling correction which adjusts the means of w to the correct value is 
therefore an unequivocal improvement. 

If even closer approximations are required, they can be ander in a large class of 

situations by methods due to Box (1949), who gives improved 4? approximations (cf. 

42.11, Vol. 3), shows how to derive a function of —2 log / which is distributed in the 


SRR distribution, and also derives an asymptotic expansion for its distribution 
function in terms of Incomplete Gamma-functions. 


Example 24.4 
k independent samples of sizes n;(¢ = 1,2,...,k3 m; 2 2) are taken from different 
normal populations with means yu; and variances oj. ‘To test 


ee ee Bae = 
Ha: Gy = Oe = an = 


a composite hypothesis imposing the r = k—1 constraints 


Oo 2 Os O;, 
SSS ee 3 
oT Oy oy 


and having s = k+1 degrees of freedom. Call the common unknown value of the 


variances o°. 
The unconditional maximum of the LF is obtained, just as in Example 24.1, by 


putting 


[Li = Xiy 
a ee = 
6; = n; Fp (x x4;—%,)? = Si, 
k 
giving Lisl fi... Gin Ge eee (24.38) 
i=1 
K 
where n= LH; 
i=1 3 
Given H,, the ML estimators of the means and the common variance o* are 
A; = Kis 
se = 
6 =-— Ln, s7 = 8, 
MN j=1 
so that 
Lilt... 4. te SS (24.39) 
From (24.4), (24.38) and (24.39), 
k 9 m/2 
= a 24. 
-" on 
so that 
—2logl = nlog(s?)— z= n; log (s?). (24.41) 


5? 
Now when H, holds, each of the statistics ("5 ne) has a Gamma distribution with para- 
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. nN s* $ - . ‘ 
meter $(m,—1), and their sum 73 has the same distribution with parameter 
GO 


k 
2 $(n;—1) = $(n—k). For a Gamma variate x with parameter p, we have 
i=1 


E {log (ax) } = ral. log (ax) e~* x?-1 dx 


0 


= loga+-S tor T{p), 


dp 
which, using Stirling’s series (3.63), becomes 
1 1 1 
E {log (ax) } = loga+-log p> 5~ syst O (3): (24.42) 


Using (24.42) in (24.41), we have 
1 1 


E{—2log]} n {log (=) +108 8) — agit (5) } 
- 30 {lo = + Tog (8 4-3-1) 3¢m,—1ye* (aa) f 


k 1 1 1 
2 OF aca) 

k 1 1 1 1\) 
— 3 m{log (1-5) -G—-3@, <3 t (9) $ 
it he Nice f Exlidé 

@-)+|(3 5-5) (Zane) 


ep ae = |+ oe 24.43 

= path (ney Ney wee 

where we now write N indifferently for 1; and . We could now improve the y? 
approximation, in accordance with (24.37), with the expression in square brackets in 


(24.43) as (k-1)-. 


Now consider Bartlett’s (1937) modification of the LR statistic (24.40) in which 
n; is replaced throughout by the “‘ degrees of freedom ” n;—1 = »,, so that is replaced 


I 


k 
by v= & (m,—-1) = n-k. We write this 
i=1 


where now 
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Thus 
k 
—2log1* = vlogs?— & », log s7. (24.44) 
ind 


We shall see in Example 24.6 that /* has the advantage over / that it gives an unbiassed 
test for any values of the n;. If we retrace the passage from (24.42) to (24.43), we 
find that 


La 1 -oe } 
= *) = —94=+5 = ij - Sa B 
E(—2logl*) = {5+ 3+ (3) } os {i +33+0(3)} 


.f ales 1 : 
= (kR—1)+3 sy +O 3 : (24.45) 
From (24.37) and (24.45) it follows that —2log/* defined at (24.44) should be divided 
by the scaling constant 


1 E44 
l+3q—(25,-3) 


to give a closer approximation to a x? distribution with (k—1) degrees of freedom. 


LR tests when the range depends upon the parameter 

24.10 The asymptotic distribution of the LR statistic, given in 24.7, depends 
essentially on the regularity conditions necessary to establish the asymptotic normality 
of ML estimators. We have seen in Example 18.5 that these conditions break down 
where the range of the parent distribution is a function of the parameter 0. What 
can be said about the distribution of the LR statistic in such cases? It is a remarkable 
fact that, as Hogg (1956) showed, for certain hypotheses concerning such distributions 
the statistic —2log/ is distributed exactly as x7, but with 2r degrees of freedom, i.e. 
twice as many as there are constraints imposed by the hypothesis. 


24.11 We first derive some preliminary results concerning rectangular distribu- 

tions. If k variables z; are independently distributed as 
ar = Hz, C2 2, = 1, (24.46) 
the distribution of 
t; = —2logs;, 
is at once seen to be 
dG = sexp(—4t,) dt, (<4, <= <, 

a x? distribution with 2 degrees of freedom, so that the sum of & such independent 
variates 


k k k 
t= Lt; = —2 % loge, = —2log Il 2; 
i=1 i=1 - i=1 


has a 7? distribution with 2k degrees of freedom. 
It follows from (14.1) that the distribution of y;,), the largest among n; independent 
observations from a rectangular distribution on the interval (0, 1), 1s 


dH = 85 dor 0S Yay < 1, (24.47) 
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and hence that y", = 2; is uniformly distributed as in (24.46). Hence for k inde- 
k 

pendent samples of size n,, ¢ = —2log II y%, has a x? distribution with 2k degrees 
i=1 


of freedom. 
Now consider the distribution of the largest of the k largest ere Yin)» Since all 


the observations are independent, this is simply the largest of m = = n,; observations 


from the original rectangular distribution. If we denote this largest = by Vm) the 
distribution of —2 log y?,, will, by the argument above, be a ¥’ a with 2 de- 


grees of freedom. We now show that the statistics 4) and u = II Vit /Vim) are inde- 


pendently distributed. Introduce the parameter 6, so that She cs rectangular 
distribution is on the interval (0, #). The joint frequency function of the y,, then 
becomes, from (24.47), 


k 
f= Th {te ity /O" = = ae Ns Vin.) 


By 17.40, yn) is sufficient for 6, and by 23.12 its distribution is complete. Thus by 
Exercise 23.7 we need only observe that the distribution of u is free of the parameter 0 
to establish the result that u is distributed independently of the complete sufficient 
statistic Vn). 

Since y;,) and u are independent, y?, and wu are likewise. If we write 4, (¢) for the 


k 
c.f. of —2log y",, d2(z) for the c.f. of —2log II yi, and ¢(¢) for the c.f. of —2logu, 
i=1 
we then have : 
(—2logu)+(—2log yim) = —2log II yin 


and using our previous results concerning y? distributions, and the fact that the c.f. 
of a sum of independent variates is the product of their c.f.’s (cf. 7.18), we have 


$(t).(1—2it)1 = (1—2it)—, 
b(t) = (1—2it)-@-, 


so that —2logu has a y? distribution with 2(k—1) degrees of freedom. 
Collecting our results finally, we have ichiaes that if we have k variates yq,) 


whence 


independently distributed as in (24.47), then = ah0E I yu. has a 7? distribution with 


k 

2k degrees of freedom, while, if yn) is the largest of the yn), 2108 { IT yit.)/ va} 
[3 

has a y? distribution with 2(k—1) degrees of freedom. 


24.12 We now consider in turn the two classes of situation in which a single 
sufficient statistic exists for 9 when the range is a function of 6, taking first the case 
when only one terminal (say the upper) depends on 6. We then have, from 17.40, the 
necessary form for the frequency function 


f(*|0) = g(x)/h(9), agx<. (24.48) 
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Now suppose we have k(> 1) separate populations of this form, f(x;|0;), and wish 
to test 
HH, 36, =%0, = «3 = 0, = 4, 
a simple hypothesis imposing & constraints, on the basis of samples of sizes n;(¢ = 1, 2, 
.,k). We now find the LR criterion for Hy. ‘The unconditional ML estimator of 
0; is the largest observation %(,) (cf. — 18.1). ‘Thus 


L(x|6,,..- 54%) = I 0 AS (is) /TF (Hay) 5- (24.49) 


Since H, is simple, the LF when it holds i is oe and no ML estimator is neces- 
sary. We have 


kom 
IAg Ca Bind.2200) =; UM tetmel Ae 

Hence the LR statistic is 

— D(x] Oo... 585) _ ak Feel] ™ 

i 24.50 

Led. aa 2 oe. 
When H, holds, y; = h(%n))/h (8) is the Sk that an observation falls below 
or at X,) and is itself a random variable with distribution obtained from that of x(q, as 

dF = n, ye dyz 0O<y < 1, 

of the form (24.47). Thus, from the result of the last section, 


: k 
—2log Il yw = —2logl 
i=1 
has a y? distribution with 2k degrees of freedom. 


24.13 We now investigate the composite hypothesis, for k > 2 populations, 
Ho; = 4, = ss = 6, 
which imposes (k—1) constraints, leaving the common value of 6 unspecified. The 
unconditional maximum of the LF is given 1 by (24.49) as before. ‘The maximum under 
our present H, is L(x| 6,6, . , 6), where 6 is the ML estimator for the pooled samples, 
which is Xj). Thus we have the LR statistic 


L(x |b, 6) _ yy L4G)” 
L (#101, +401) ~ ea [AGP 


re i [F eo)" f E a’ 
h(@) h(6) 4° 
where 6 is the common anapesified value of the 0;, we see that in the notation of the 


last section, 
k 
ae hed / Vin)» 
i= 


so that by 24.11 we have that in this case —2log/ is distributed like y? with 2(k—1) 
degrees of freedom. 


= 


(24.51) 


By writing this as 
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24.14 When both terminals of the range are functions of 0, we have from 17.41 

that if there is a single sufficient statistic for 0, then 
f(#|0)=g(x)/h(), O0<" <d(O), (24.52) 
where (0) must be a monotone decreasing function of 6. For k > 1 such populations 
f («;|6,;), we again test the simple 
Hi, = 0, = eee =§,= 0, 
on the basis of samples of sizes n;. The unconditional ML estimator of 6; is the 
sufficient statistic 
t; = MIN {X17 (Xap) 5s 

where x; Xi) are respectively the smallest and largest observations in the zth sample. 
When H, holds, the LF is specified by L(x|6,...,6 ). Thus the LR statistic 


_ E(x|80 ++ +590) _ Ff Fey (24.53) 


Just as for (24.50), we see that 
k 
P= I] -y%, 
i=1 


where the y; are distributed in the form (24.47), and hence —2log/ again has a ¥? 
distribution with 2k degrees of freedom. 
Similarly for the composite hypothesis with (k—1) constraints (k > 2) 


Ay: 9, =0,= 5 ee, 
we find, just as in 24.13, that the LR statistic is 
L(x|6,...,6)_ ? 
es ae ee Es: h(t)1” 
LAx Oy «4; Ox5) ous ( )] /[ (z)] 


where ¢t = min{t;} is the combined ML estimator 0, so that by writing 


= ABET /GST 


we again reduce / to the form required in 24.11 for —2log/ to be distributed like +? 
with 2(k—1) degrees of freedom. 


24.15 We have thus obtained exact y? distributions for two classes of hypotheses 
concerning distributions whose terminals depend upon the parameter being tested. 
Exercises 24.8 and 24.9 give further examples, one exact and one asymptotic, of LR 
tests for which —2log/ has a y? distribution with twice as many degrees of freedom 
as there are constraints imposed by the hypothesis tested. It will have been noted 
that these y? forms spring not from any tendency to multinormality on the part of 
the ML estimators, as did the limiting results of 24.7 for ‘ regular” situations, but 
from the intimate connexion between the rectangular and y? distributions explored in 
24.11. One effect of this difference is that the power functions of the tests take a 
quite different form from that obtained by use of the non-central x? distribution in 
24.8. 
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Barr (1966) finds the power function of the test of the simple hypothesis based 
on the LR statistic (24.50), and shows that the test is unbiassed. When k = 1, the 
test is shown to be UMP (cf. Example 22.6 and the condition (22.37)), but there is no 
UMP test for k>1, and the LR test is not even UMPU. For the composite hypothesis 
of 24.13 with k = 2, the power function of the LR test statistic (24.51) is used to show 
that the LR test is UMPU. 


The properties of LR tests 

24.16 So far, we have been concerned entirely with the problems of determining 
the distribution of the LR statistic, or a function of it. We now have to inquire into 
the properties of LR tests, in particular the question of their unbiassedness and whether 
they are optimum tests in any sense. First, however, we turn to consider a weaker 
property, that of consistency, which we now define for the first time. 


Test consistency 

24.17 A test of a hypothesis H, against a class of alternatives H, is said to be con- 
sistent if, when any member of H, holds, the probability of rejecting H, tends to 1 as 
sample size(s) tend to infinity. If w is the critical region, and x the sample point, we 
write this 


lim P{x ew|H,} = 1. (24.54) 
n—> oo 


The idea of test consistency, which is a simple and natural one, was first introduced 
by Wald and Wolfowitz (1940). It seems perfectly reasonable to require that, as the 
number of observations increases, any test worth considering should reject a false 
hypothesis with increasing certainty, and in the limit with complete certainty. ‘Test 
consistency is as intrinsically acceptable as is consistency in estimation (17.7), of which 
it is in one sense a generalization. For if a test concerning the value of @ is based on 
a statistic which is a consistent estimator of 0, it is immediately obvious that the test 
will be consistent too. But an inconsistent estimator may still provide a consistent 
test. For example, if t tends in probability to a0, ¢ will give a consistent test of hypo- 
theses about 6. In general, it is clear that it is sufficient for test consistency that the 
test statistic, when regarded as an estimator, should tend in probability to some one- 
to-one function of 0. 

Since the condition that a size-« test be unbiassed is (cf. (23.53)) that 


Pix ew|H,} > a, (24.55) 


it is clear from (24.54) and (24.55) that a consistent test will lose its bias, if any, as 
n—> oo. However, an unbiassed test need not be consistent.” 


The consistency and unbiassedness of LR tests 


24.18 We saw in 18.10 and 18.22 that under a very generally satisfied condition, 
the ML estimator 6 of a parameter-vector 6 is consistent, though in other circumstances 


(*) Cf. the remark in 17.9 on consistent and asymptotically unbiassed estimators. 
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it need not be. If we take it that we are dealing with a situation in which all the ML 
estimators are consistent, we see from the definition of the LR statistic at (24.4) that, 
as sample sizes increase, 

L (x | 8,0; 9s) 
L(x| 0,95)’ 
where 0,,0, are the true values of those parameters, and 6,, is the hypothetical value 
of 0, being tested. ‘Thus, when H, holds, /—> 1 in probability, and the critical region 
(24.6) will therefore have its boundary c, approaching 1. When H, does not hold, 
the limiting value of J in (24.56) will be some constant k satisfying (cf. (18.20) ) 

GS f= 


ps (24.56) 


and thus we have 
Pil <c -*4 (24.57) 


and the LR test is consistent. 


In 24.8 we confirmed from the approximate power function that LR tests are con- 
sistent under regularity conditions, and in 24.15 we deduced consistency in another case, 
not covered by 24.8. Both of these examples are special cases of our present dis- 
cussion. 


24.19 When we turn to the question of unbiassedness, we recall the penultimate 
sentence of 24.17 which, coupled with the result of 24.18, ensures that most LR esti- 
mators are asymptotically unbiassed. Of itself, this is not very comforting (though 
it would be so if it could be shown under reasonable restrictions that the maximum 
extent of the bias is always small), for the criterion of unbiassedness in tests is intuitively 
attractive enough to impose itself as a necessity for all sample sizes. Example 24.5 shows 
that an important LR test is biassed. 7 


Example 24.5 

Consider again the hypothesis H, of Example 24.3. ‘The LR test uses as its critical 
region the tails of the y2_, distribution of t = ns?/og determined by (24.33). Now 
in Examples 23.12 and 23.14 we saw that the unbiassed (actually UMPU) test of Hy 
was determined from the distribution of ¢ by the relations 


P{t < a,}+P{t > b,} =, 
a’)? exp (—a,/2) = bY —” exp (—5,/2). 


It is clear on comparison of (24.58) with (24.33) that they would only give the same 
result if a, = b,, which cannot hold except in the trivial case a,—b, = 0,a = 1. In 
all other cases, the tests have different critical regions, the LR test having higher values 
of a, and b, than the unbiassed test, i.e. a larger fraction of « concentrated in the lower 
tail. It is easy to see that for alternative values of o? just larger than of, for which 
the distribution of ¢ is slightly displaced towards higher values of ¢ compared to its 
H, distribution, the probability content lost to the critical region of the LR test through 
its larger value of b, will exceed the gain due to the larger value of a, ; and thus the 
LR test will be biassed. 


(24.58) 
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It will be seen by reference to Example 23.12 that whereas the LR test has values 
of a,, b, too large for unbiassedness, the ‘“ equal-tails’ test there discussed has a,, b, 
too small for unbiassedness. ‘Thus the two more or less intuitively acceptable critical 
regions ‘‘ bracket’ the unbiassed critical region. 

If in (24.33) we replace m by n—1, it becomes precisely equivalent to the unbiassed 
(24.58), confirming the general fact that the LR test loses its bias asymptotically. It is 
suggestive to trace this bias to its source. If, in constructing the LR statistic in Example 
24.3, we had adjusted the unconditional ML estimator of o? to be unbiassed, s? would 
have been replaced by (73) s*, and the adjusted LR test would have been un- 
biassed : the estimation bias of the ML estimator lies behind the test bias of the LR 
test. 


Unbiassed tests for location and scale parameters 

24.20 Example 24.5 suggests that a good principle in constructing a LR test is 
to adjust all the ML estimators used in the process so that they are unbiassed. A 
further confirmation of this principle is contained in Example 24.4, where we stated 
that the adjusted LR statistic /* gives an unbiassed test. We now prove this, develop- 
ing a method due to Pitman (1939b) for this purpose. 

If the hypothesis being tested concerns a set of k location parameters 0; 
(¢ = 1,2,...,k), we write the joint distribution of the variates as 


dF = f (%1—01,%.—Oo,..., %,—O,) dx... AX, (24.59) 

We wish to test 
Htge (35.4 = (24.60) 
Any test statistic ¢, to be satisfactory intuitively, must satisfy the invariance condition 
E(%y, Bg, = «5 My) = U(x, —A, 4a—A, .... , Hyd) (24.61) 


We may therefore without loss of generality take the common value of the 6, in (24.60) 
to be zero. We suppose that ¢ > 0, and that wo, the size-« critical region based on 
the distribution of ¢, is defined by 


t<c,; (24.62) 


if either of these statements were not true, we could transform to a function of ¢ for 
which they were. 

Because of its invariance property (24.61), ¢ must be constant in the k-dimensional 
sample space W on any line L parallel to the equiangular vector V defined by 


X, = %, =... = x, and thus w, will lie outside a hypercylinder with axis parallel 
to V. When H, holds, the content of w, is its size 
a Ai Mee kd), (24.63) 


and when H, is not true the content of w, is its power 


1-6 = | dF (Oy %4—On «+4 8x—O) = | dF (1, %—) ... 582, (24.64) 
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where w, is derived from w, by translation in W without rotation. We define the 
integral, over any line L parallel to V, 


P(L) = 3 Te ee (24.65) 


Since variation in * is along any line L, and the aggregate of lines L is the whole of W, 
we have 


| Puyat ss [{ J sashar = { = | fd. dey =" (24.66) 


We now determine w, as the aggregate of all lines Z for which the statistic P(L) < some 
constant h. Then P(L) will exceed h on any L which is in w, but not in wy. Hence, 


from (24.63), (24.64) and (24.66), 
i dF < | = 


so that the test is unbiassed. We therefore need only define the test statistic ¢ so that 
at any point on a line L, parallel to V, it is equal to P(L). Now using the invariance 
property (24.61) with 4 = # we have, conditionally upon JL, 


t(x)|L = | fe —8 5.5... ud, 
L 
so that unconditionally we have for the test statistic 
t(x) = it {G3 — #8, %g— #, . s. 1.8) 43} aL 
L 


and replacing # by u, this is, on integration with respect to JL, 
t(x) = 2 fie aes, =. (24.67) 


the unbiassed size-« region being defined by (24.62). It will be seen that the unbiassed 
test thus obtained is unique. An example of the use of (24.67) is given in Exercise 24.15. 


24.21 ‘Turning now to tests concerning scale parameters, which are more to our 
present purpose, suppose that the joint distribution of k variates is 3 


dy, @ ad 
Is (3233... aes V1 ay - Vk 24.68 
jo. ee ee 


where all the scale parameters ¢,; are positive. We make the transformation 


=logly:|, 9: = log¢, 
and find for the distribution of the x; 


dF = g{exp(x,—0,), exp (x.—6,),...,exp(x,—6;) } 
k 
exp { = («0 dx, a os (24.69) 
i=1 
(24.69) is of the form (24.59) which we have already discussed. ‘To test 
Hy: $1, = $2 =... = | (24.70) 


244 THE ADVANCED THEORY OF STATISTICS 
is the same as to test Hy of (24.60). ‘The statistic (24.67) becomes 


00 k 
t(x) = | g{exp(x,—u), exp(x.—u),... exp (vu) }exp{ x ku} du 
— i=1 


which when expressed in terms of the y; becomes 


= dv 
t(y) = i Il fa (2828... 28) ae. (24.71) 


24.22 Now consider the special case of k independently distributed Gamma vari- 


ates 3 with parameters m;. ‘Their joint distribution is 
1 y mi—1 k Vi k dy, 
dG = 14; ml x) hex (- Eat) m3 24.72 
i-1 U'(m,) \¢ ee (24.72) 
To test Hy of (24.70), we use (24.71) and obtain 
Td Oh othe 2 24.73 
t(y) ss '(m,) f: Sep vee Yi! ) me? (24.73) 
k k 
where m= X% m,. On substituting wu = & y,/v in (24.73), we find 
i=1 i=1 
IT yy 
| I’ (m) Ses 
t(y) = |= ees 24.74 
= [rr |@>) sy 


We now neglect the constant factor in square brackets in (24.74). From the remainder, 
T, the maximum attainable value of ¢, occurs when y;/Xy,; = m;/m, when 
7 i 


T = Tmt /m”. (24.75) 


We now write 


ay Vi 
* — — log (7) — m log = — im, log (2*), (24.76) 


t* will be unbiassed for Hj, and will range from 0 to «, large values being rejected. 


Example 24.6 
We may now apply (24.76) to the problem of testing the equality of k normal vari- 
ances, discussed in Example 24.4. For each of the quantities 2 (x,;;—%*;)*/(207) 1s, 
j 


when H, holds, a Gamma variate with parameter $(n;—1). We thus have to substitute 
in (24.76) 
Ye = U(x; —H,)P? = 583, 
m,; = $(n;,—1) = },, (24.77) 
ms am = 3(n—k) = by, 
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and we find for the unbiassed test statistic 
LY; s? 


2° = vlog ( — Xv, log (s?). (24.78) 
j 


Vv 


(24.78) is identical with (24.44), so that 2z* is simply —2log/* which we discussed 
there. Thus the /* test is unbiassed, as stated in Example 24.4. From this, it is fairly 
evident that the unadjusted LR test statistic ] of (24.40), which employs another 
weighting system, cannot also be unbiassed in general. When all sample sizes are 
equal, the two tests are equivalent, as Exercise 24.7 shows. Even in the case k = 2, 


the unadjusted LR test is biassed when n, 4 n,: this is left to the reader to prove 
in Exercise 24.14. 


24.23 Before leaving the question of the unbiassedness of LR tests, it should be 
mentioned that Paulson (1941) investigated the bias of a number of LR tests for expon- 
ential distributions—some of his results are given in Exercises 24.16 and 24.18—and that 
Daly (1940) and Narain (1950) have shown a number of LR tests of independence in 
multinormal systems to be unbiassed: we shall refer to their results as we encounter 
these tests in Volume 3. 


Other properties of LR tests 


24.24 Apart from questions of consistency and unbiassedness, what can be said 
in general concerning the properties of LR tests? In the first place, we know that 
ML estimators are functions of the sufficient statistics (cf. 18.4) so that the LR statistic 
(24.4) may be re-written 
_ L(%| ro, ts) 

L (x| T,+5s) 


where ¢, is the vector minimal sufficient for 0, when H, holds and T;,,, is the statistic 
sufficient for all the parameters when H, does not hold. As we have seen in 17.38, it 
is not true in general that the components of 7',,, include the components of ¢,—the 
sufficient statistic for 6, when H, holds may no longer form part of the sufficient set 
when H, does not hold, and even when it does may not then be separately sufficient 
for 6,, merely forming part of 7T,,, which is sufficient for (0,,0,). ‘Thus all that we 
can say of / is that it is some function of the two sets of sufficient statistics involved. 
There is, in general, no reason to suppose that it will be the right function of them. 

It is easily seen that the LR method does not necessarily produce a UMP test 
when one exists, by observing that even in the case of testing a simple Hy against a 
simple Hy, it does not yield the BCR (22.6). 

If we are seeking a UMPU test, the LR method is handicapped by its own general 
biassedness, but we have seen that a simple bias adjustment will sometimes remove 
this difficulty. ‘The adjustment takes the form of a “‘ reweighting ” of the test statistic 
by substituting unbiassed estimators for the ordinary ML estimators (Examples 24.4 
and 24.6, Example 24.5), or sometimes equivalently of an adjustment of the critical 
region of the statistic to which the LR method leads (Exercise 24.14). Exercise 24.16 
shows that two UMP tests derived in Exercises 23.25-26 for an exponential and a 
rectangular distribution are produced by the LR method, while the UMPU test for 


I (24.79) 
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an exponential distribution given in Exercise 23.24 is not equivalent to the LR test, 
which is biassed. 

Wald (1943a) shows that the LR test asymptotically has a number of optimum 
power properties under regularity conditions—but see 25.4 and Example 25.1. Hoeff- 
ding (1965) derives an optimum property of LR tests for multinomial distributions when 
test size « +0 as sample size > oo. 

The LR principle is an intuitively appealing one when there is no “ optimum ”’ test. 
It is of particular value in tests of linear hypotheses (which we shall discuss in the 
second part of this chapter) for which, in general, no UMPU test exists. But it is as 
well to be reminded of the possible fallibility of the LR method in exceptional cir- 
cumstances, and the following example, adapted from one due to C. Stein and given 
by Lehmann (1950), is a salutary warning against using the method without investigation 
of its properties in the particular situation concerned. 


Example 24.7 
A discrete random variable x is defined at the values 0, +1, +2, and the prob- 
abilities at these points given a hypothesis H, are: 


x ) +1 +2 —2 
1-6 1—0 24.80 
ae «(7 2) 1-«) (7=2) On0s, Gulla tad : | 


The parameters 6,, 02, are restricted by the inequalities 
G<%, <¢e< 4, Veta. 
where « is a known constant. We wish to test the simple 
Hy:0,=4, 6, = 3%, 
H, being the general alternative (24.80), on the evidence of a single observation. ‘The 
probabilities on Hy are: 
2 0 +1 +2 —2 
P| H,: OL 4—¢% 40 sa 
The LF is independent of 6, when x = 0, +1, and is maximized unconditionally by 
making 6, as small as possible, i.e. putting 6, = 0. The LR statistic is therefore 
= L(x | 0) 
L (x | 01, 05) 
When x = +2 or —2, the LF is maximized unconditionally by choosing 6, respectively 
as large or as small as possible, i.e. 6, = 1,0, respectively ; and by choosing 0, as 
large as possible, i.e. 0, = « The maximum value of the LF is therefore « and the 
LR statistic is 


(24.81) 


aig oa 0. oe (24.82) 


25 2 > (24.83) 


Since « < }, it follows from (24.82) and (24.83) that the LR test consists of rejecting 
H, when x = +2. From (24.81) this test is seen to be of size «. But from (24.80) 
its power is seen to be 6, exactly, so for any value of 6, in 


0<I,<« (24.84) 


LIKELIHOOD RATIO TESTS 247 


the LR test will be biassed for all 6,, while for 0, = « the test will have power equal 
to its size « for all @,. In this latter extreme case the test is useless, but in the former 
case it is worse than useless, for we can get a test of size and power « by using a table 
of random numbers as the basis for our decision concerning Hy. Furthermore, a 
useful test exists, for if we reject Hy when x = 0, we still have a size-« test by (24.81) 


and its power, from (24.80), 1s «(=2) which exceeds « when (24.84) holds and 


equals « when 6, = «. 

Apart from the fact that the random variable is discrete, the noteworthy feature of 
this cautionary example is that the range of one of the parameters is determined by «, 
the size of the test. 

D. R. Cox (1961, 1962) considers the distribution of LR statistics when H, and Hi, 


are entirely separate families of composite hypotheses (so that (24.5) no longer holds) 
and obtains some large-sample results. 


The general linear hypothesis and its canonical form 
24.25 We are now in a position to discuss the problem of testing hypotheses in 
the general linear model of Chapter 19. As at (19.8), we write 
y = XO+e, (24.85) 
where the ¢, have zero means, equal variances o? and are uncorrelated. For the 
moment, we make no further assumptions about the form of their distribution. We 
take X’X to be non-singular : if it were not, we could make it so by augmentation, as 
in 19.13-16. 
Suppose that we wish to test the hypothesis 
7, : AGH e,,; (24.86) 


where A is a (r xk) matrix and cy a (rx 1) vector, each of known constants. (24.86) 
imposes r(<k) constraints, which we take to be functionally independent, so that A 
is of rank r. H, is simply the negation of Hj. Whenr = k, A’A is non-singular and 
(24.86) is equivalent to Hy: @ = (A’A)-1A’c,. If A is the first r rows of the (m x k) 
matrix X, we also have a particularly direct Hy, in which we are testing the means of 
the first 7 4;. 
Consider the (7x1) vector 
z = C(X'X)-1 X’y = CO, (24.87) 


where C is a (nXk) matrix and 6 is the Least Squares (LS) estimator of @ given at 
(19.12). Then, from (24.87) and (24.85), 


z, = C6+C(K’X)“"X’e, 


so that 
gi Es) = C0 (24.88) 
and the dispersion matrix of z is, as in 19.6, 
y= fC 4) C.. (24.89) 
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Let us now choose C so that the components of z are all uncorrelated, i.e. so that 
V=o°I. 
From (24.89), this requires that 
CX X)HAC =] 
or, if C’C is non-singular, that 
C'C= XX. (24.90) 

(24.90) is the condition that the z; be uncorrelated, and implies, with (24.87) and 
(24.88), that 

(z—p)'(z—w) = eX (XX) X’e = {X(b—6)}’ {X(b—9)}. (24.91) 


(8) 


where A is the (r x k) matrix in (24.86), D is a ((k—1r) x k) matrix and F is a ((n—k) x k) 


matrix satisfying 


24.26 We now write 


F0 = 0. (24.92) 


Since A is of rank 7, we can choose D so that the (k x k) matrix is is non-singular, 


and thus C is of rank k. C’C is then also of rank k, and hence non-singular as required 
above (24.90). 
From (24.88), we have 


A 
a = Ez [D 0. (24.93) 
F 
Thus the means of the first r z; are precisely the left-hand side of (24.86), so that H, 
is equivalent to testing 
Hin = Ee) =<e, es ee (24.94) 
a composite hypothesis imposing 7 constraints upon the parameters. Since, by (24.92), 
the last (n—k) of the uw; are zero, there are k non-zero parameters y;, which together 
with o? make up the total of (k+1) parameters. 


24.27 Thus we have reduced our problem to the following terms: we have a set 
of n mutually uncorrelated variates z; with equal variances o?. (n—k) of the z; have 
zero means, and the others non-zero means. ‘The hypothesis to be tested is that 7 of 
these k variates have specified means. This is called the canonical form of the 
general linear hypothesis. 

In order to make progress with the hypothesis-testing problem, we need to make 
assumptions about the distribution of the errors in the linear model (24.85): specifi- 
cally, we take each ¢,; to be normal and hence, since they are uncorrelated, independent. 
The z;, being linear functions of them, will also be normally distributed and, being 
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uncorrelated, independently normally distributed. Their. joint distribution therefore 
gives the LF 


L (zu, 0%) = (2xa4)-"exp { — > (2— wy (zw) | 
= (2eat)-"/exp | 54 (eth)! et) 


+ (er Ba (ar Ya-n) +2h-1-a} |, (24.95) 


where sufhxes to vectors denote the number of components in the vector. Our hypo- 
thesis is 

Hote = C5 (24.96) 
and H, is its negation. 

We saw in Example 23.14 that if we have only one constraint (r = 1), there is 
a UMPU test of H, against H,, as is otherwise obvious in our present application from 
the fact that we are then testing the mean of a single normal population with unknown 
variance: the UMPU test is, as we saw in Example 23.14, the ordinary “‘ equal-tails ” 
‘“‘ Student’s ”’ t-test for this hypothesis. 

Kolodzieczyk (1935), to whom the first general results concerning the linear hypo- 
thesis are due, demonstrated the impossibility of a UMP test with more than one 
constraint, and showed that there is a pair of one-sided UMP similar tests when r = 1: 
these are the one-sided “‘ Student’s ”’ t-tests (cf. Example 23.7). We have just seen 
that there is a two-sided ‘‘ Student’s ”’ t-test which is UMPU for r = 1, but the critical 
region of this test is different according to which of the yw; is being tested: thus there 
is no common UMPU critical region for r > 1. 

Since there is no “‘ optimum ”’ test in any sense we have so far discussed, we are 
tempted to use the LR method to give an intuitively reasonable test. 


24.28 ‘The derivation of the LR statistic is simple enough. ‘The unconditional 
maximum of (24.95) is obtained by solving the set of equations 


ome OB eo). 
Oi 
dlogL _ 
d(o?) 
The ML estimators thus obtained are 
Gy = Fi, Ee oe Bee 


whence 
(z— 0)’ (2—f) = nd? 
Thus the unconditional maximum of the LF is 


n —n/2 
L(z|Q, 62) = (20 62e)-"? = (“= > ) (24.97) 


NM j=k+1 
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When the hypothesis (24.96) holds, the ML estimators of the unspecified parameters 


are 
hy = iy t=r+l1,r+2,...,h, 


whence 
(z— fi) (z—@) = nd, 
so that the conditional maximum of the LF is 


2 n r —n/2 
L(z| eg, fi, 2) = (206? e) *? = ea XL af+ YB (%;—Cg;)? \ (24.98) 
NM \i=k+1 i=1 
From (24.97) and (24.98) the LR statistic / is given by 
2/n — 6? = ! Z ° 
l a Taw’ (24.99) 


where 


Xi (%i—Coi)? ans 
= tS. (24.100) 
U3; 
i=k+1 ? 


Tt will be observed from above that né?, né? are respectively the minima of 
(z—p)'(z—p) with respect to w under Hy and H,. By (24.91), these are the same 
as the minima of = : 

R = {X(8—6)}"{X(8—6)} 
with respect to 6. The identity in 0 
S = e’'e = (y—X6)'(y—X6) = (y—X6)'(y—X6)+R 
is easily verified by direct expansion, and the term on its right 
(y—X9)'(y—X9) 
does not depend on @. Minimization of R with respect to 6 is therefore equivalent 
to minimization of S. But the process of minimizing S for @ is precisely the means 
by which we arrived at the Least Squares solution in 19.4. ‘To obtain 6?, 6? in (24.100), 
therefore, we minimize S in the original model under Hy and H, respectively. 

Since /is a monotone decreasing function of W, the LR test is equivalent to rejecting 
H, when W is large. If we divide the numerator and denominator of W by o?, we 
see that when H, holds, W is the ratio of the sum of squares of r independent normal 
variates to an independent sum of squares of (n—k) such variates, i.e. is the ratio of 
two independent y? variates with r and (n—k) degrees of freedom. ‘Thus, when Hy 


holda, F = — W is distributed in the variance-ratio distribution (cf. 16.15) with 


(r,n—k) degrees of freedom and the LR test is carried out in terms of F, large values 
forming the critical region. 
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Many of the standard tests in statistics may be reduced to tests of a linear hypothesis, 
and we shall be encountering them frequently in later chapters. 


Example 24.8 
As a special case of particular importance, consider the hypothesis 
a2; & @, 
where 6, is a (rx 1) subvector of 6 in (24.85). We may therefore rewrite (24.85) as 
y= (% X,) (4 )+e 
k—f? 

where X, 1s of order (x xr) and X, is of order (nx (k—r)). ‘Then H, becomes equi- 
valent to specifying 

y = X,0,_,+¢€. 
In accordance with 24.28, we find the minima of S = e’e. Since we are here esti- 


mating all the parameters of a linear model both on Hy, and on H,, we may use the 
result of 19.9. We have at once that the minimum under H, is 


né? = y’ {I—X,(X, X,)-Xphy 
while under H, it is 
no? = y’ {I—X(X'X)-1X’}y 
where X = (X, X,). The statistic 
— n—k (=) 


r 


is distributed in the variance-ratio distribution with (r,n—k) degrees of freedom, the 
critical region for H, being the 100« per cent largest values of F. 


24.29 In 24.28 we saw that the LR test is based on the statistic (24.100) which 


may be rewritten 


r 

Xi (&;—C;)?/o? 
a i=1 

x 27/0? 

i=k+1 
Whether H, holds or not, the denominator of W is distributed like y? with (n—k) 
degrees of freedom. When H, holds, as we saw, the numerator is also a y? variate 
with r degrees of freedom, but when H, does not hold this is no longer so: in fact, 
the numerator will always be a non-central y? variate (cf. 24.4) with r degrees of freedom 
and non-central parameter 


= eu Pe = (Ur— Co)’ (Ur — Co) /0*, (24.101) 


where 4; is the true mean of z;. Only when H, holds is 4 equal to zero, giving the 
central y? distribution of 24.28. Since we wish to investigate the distribution of W 
(or equivalently of F) when H, is not true, so that we can evaluate the power of the 
LR test, we are led to the study of the ratio of a non-central to a central y? variate. 
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The non-central F distribution 

24.30 Consider first the ratio of two variates 2), 22, independently distributed 
in the non-central 7? form (24.18) with degrees of freedom »,, », and non-central 
parameters /,, A, respectively. Using (11.74) the distribution of u = 2,/% is given by 


e~ But) (y yi} © A (uv)’ . : 
PPT E—Y}LE) xo Qn! BRO Ib +7) 


e204 Aa) gy2%a— 1 . As vs : 
* anil T(,-1)}TGE 5 5 BY (v2— 1), 3+s}vudv. 
If we write 2 = A, +A, » = 71+, and simplify, this becomes 
ett 6a) Wp aie PO eae 
27 mo sno (2)! (2)! PQ) E Gr+7) TG) P Grats) 


x { | © g-ioll-+u) ght rte aol wrtr—l dy, (24.102) 
0 


dH (u) = du | a, 


dH (u) = 


=e ty+7+8 
The integral in (24.102) is equal to '($»+r+s) 52fT has 


00 ee) pus i Aes +s)27*8 
4A 
di) = 2" 2 > iCal BG thr.t 
1 tv+r+s 
prtr—1f_ | 24.103 
xu (=) a (24.103) 


Since 
iene Stes) a 
(2r)! (2s)! {1 (4) }? Vegi ae ss 
(24.103) may finally be simplified to 
BA,)" (3A Po \nt du 
= z Leng ie ai . oe (=) B(4v, +1, kvg+s) OD 
a result obtained by Tang (1938) ee studied by Price (1964). 


24.31 To obtain the distribution of a non-central y? variate divided by its degrees 
of freedom to an independent central y? variate similarly divided, we put 


poo ee ee 


%o/Vo Vy 


(34)" (2)"" 
icy een 3 CLM (Fyn 


r=0 B (37, +7, 5V9) (1422 SS , = 


in (24.104). The result is 


dF". (24.105) 


(24.105) is a generalization of the variance-ratio (F) distribution (16.24), to which it 
reduces when 2 = 0. It is called the non-central F distribution with degrees of free- 
dom y,,”. and non-central parameter 4. We sometimes write it F’(»,,%2,4). Like 
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(24.18), it was first discussed by Fisher (1928a), and it has been studied by Wishart 
(1932), Tang (1938), Patnaik (1949), Price (1964), and Tiku (1965a). 


The power function of the LR test of the linear hypothesis 


24.32 It follows at once from 24.28-9 and 24.31 that the power function of the 
LR test of the general linear hypothesis is 


Pa | oe e..s Dh (24.106) 
Py V9, 0) 


where F’, is the 100(1—«) per cent point of the distribution, », = 7, », = n—k, and 
4 is defined at (24.101). 


Several tables and charts of P have been constructed : 


(1) Tang (1938) gives the complement of P (i.e. the Type II error) for test sizes « = 0-01, 
0:05; », (his f,) = 1(1)8; 2 (his fe) = 2(2)6(1) 30,60, 00; and 
d = {A/(vx,+1)}* = 1 (0-5) 3 (1) 8. These tables are reproduced in Mann (1949) 
and in Kempthorne (1952). 

(2) Lehmer (1944) gives inverse tables of ¢ for « = 0-01, 0:05; », = 1(1)10, 12, 15, 20, 
24, 30, 40, 60, 80, 120, 00; v, = 2(2)20, 24, 30, 40, 60, 80, 120, 240, 00; and 
P thet pf) = 07, OS. 

(3) E. S. Pearson and Hartley (1951) give eight charts of the power function, one for each 
value of vy, from 1 to 8. Each chart shows the power for v, = 6 (1) 10, 12, 15, 20, 
30, 60, co; « = 0-01, 0-05 ; and ¢ ranging from 1 (except when », = 1, when ¢ 
ranges from 2 for « = 0-01, 1-2 for « = 0-05) to a value large enough for the power 
to be equal to at least 0-98. The table for »; = 1 is reproduced in the Biometrika 

- Tables. 

(4) M. Fox (1956) gives inverse charts, one for each of the combinations of « = 0-01, 0:05, 

with power P (his f) = 0:5, 0-7, 0-8, 0-9. Each chart shows, for 
v, = 3(1)10 (2) 20(20) 100, 200, co ; v2 = 4(1) 10 (2) 20 (20) 100, 200, a, 

the contours of constant ¢. He also gives a nomogram for each « to facilitate inter- 
polation in Pf. 

(5) A. J. Duncan (1957) gives two charts, one for « = 0-01 and one for « = 0:05. Each 
shows, for v, = 6(1)10, 12, 15, 20(10) 40, 60, 00, the values of », (ranging from 
1 to 8) and ¢ required to attain power P = 1—f = 0°50 and 0-90, 


Approximation to the power function of the LR test 

24.33 As will be seen from the form of (24.105), the computation of the exact power 
function (24.106) is a laborious matter, and even now its tabulation is far from complete. 
However, we may obtain a simple approximation to the power function in the manner 
of, and using the results of, our approximation to the non-central x? distribution in 
24.5. If z, is a non-central y? variate with », degrees of freedom and non-central 
parameter A, we have from (24.21) that 2, / (“ 1S 
1 


variate with degrees of freedom »* = (y,+A)?/(v,+2A). Thus 


x [ole )} «nts 


is approximately a central y? variate divided by its degrees of freedom »*. Hence 


)is approximately a central y? 
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2,/v, is approximately a multiple (v,+A)/», of such a variate. If we now define the 
non-central F’-variate 


Ps 33/ *1 
Ze/VYe 
where 2, is a central y* variate with », degrees of freedom, it follows at once that 
approximately 


F’ = eae, (24.107) 
1 

where F is a central F-variate with degrees of freedom »* = (v,+A)?/(v,+2A) and 75. 

The simple approximation (24.107) is surprisingly effective. By making compari- 
sons with 'Tang’s (1938) exact tables, Patnaik shows that the power function calculated 
by use of (24.107) is generally accurate to two significant figures ; it will therefore 
suffice for all practical purposes. 

To calculate the power of the LR test of the linear hypothesis, we therefore replace 
(24.106) by the approximate central F-integral 


oe) +A)? | 
p=", ac] F(a ; )} 24.108 
| (udreonn Le Nv teay ee 
the size of the test being determined by putting 2 = 0. (v,+A)?/(v,+2A) is generally 
fractional, and interpolation is necessary. Even the central F distribution, however, 
is not yet so very well tabulated as to make the accurate evaluation of (24.108) easy— 
see the list of tables in 16.19. 


A three-moment central F-approximation due to Tiku (1965a, 1966) is even more 


accurate. 
Dar (1962) gives a simple normal approximation to the distribution of the ratio of 
two independent identical non-central F-variables. 


The non-central t-distribution 

24.34 When », = 1, the non-central F distribution (24.105) reduces to the non- 
central 2? distribution, just as for the central distributions (cf. 16.15). If we trans- 
form from #2 to t, we obtain the non-central t-distribution, which we call the z’-distribu- 
tion. Evidently, from the derivation of non-central y? as the sum of non-central 
squared normal variates, we may write 

t' = (z+0)/w’, (24.109) 

where zg is a normal variate with zero mean and w is independently distributed like y?/f 
with f degrees of freedom (we write f instead of v, in (24.105), and 6? = 4, in this case). 
Our discussion of the F’’ distribution covers the t’2 distribution, but the #’ distribution 
has received special attention because of its importance in applications. 


Johnson and Welch (1939) studied the distribution and gave tables for finding 
100(1—«) per cent points of the distribution of ¢’ for « or 1—« = 0-005, 0-01, 0-025, 
0-05, 0-1(0-1)0-5, f = 4(1) 9, 16, 36, 144, oo, and any 6; and conversely for finding 6 
for given values of ¢’. Resnikoff (1962) gives additional tables. 

Resnikoff and Lieberman (1957) have given tables of the frequency function and the 
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distribution function of t’ to 4 d.p., at intervals of 0-05 for t/f?, for f = 2 (1) 24 (5) 49, 
and for the values of 6 defined by 


co 
| (22) —? exp(— $x”)dx = a, 
6/(f +1) 
a = 0-001, 0:0025, 0-004, 0-01, 0-025, 0-04, 0-065, 0-10, 0°15, 0-25. They, and also Scheuer 
and Spurgeon (1963), give some percentage points of the distributions. Locks et al. (1963) 
give similar tables at intervals of 0-2 for t’, with f = 1 (1) 20 (5) 40 and 6 defined by 
O(f+1)- or 6(f+2)-* = 0(0:25) 3. Owen (1963) gives very extensive tables of per- 
centage points. Hogben et al. (1961) give a method of obtaining the moments, with 
tables for the first four. Amos (1964) studies series approximations of the distribution. 
Krishnan (1967) studies the moments of (24.109) derived from (24.104) instead of 
(24.105) (so that w is a non-central x”) and approximates its distribution. 


24.35 A particular important application of the t’ distribution is in evaluating the 
power of a “‘ Student’s ”’ t-test for which the critical region is in one tail only (the ‘‘ equal- 
tails? case, of course, corresponds to the ¢’* distribution). The test is that 6 = 0 in 
(24.109), the critical region being determined from the central t-distribution. Its power 
is evidently just the integral of the non-central t-distribution over the critical region. It 
has been specially tabulated by Neyman et al. (1935), who give, for « = 0:05, 0-01, 
f (their m) = 1(1) 30, 00 and 6 (their p) = 1(1)10, tables and charts of the complement 
1 —P of the power of the test, together with the values of 6 for which P = 1—a. Neyman 
and Tokarska (1936) give inverse tables of 6 for the same values of « and f and 
1—P = 0:05, 0:10, (0:10) 0:90. Owen (1965) gives 5d.p. tables of 6 for « = 0-05, 
0-025, 0:01 and 0-005; m—1 = 1 (1) 30 (5) 100 (10) 200, «© and 1—P= 0-01, 0-05, 
0-10 (0-10) 0-90. 


Optimum properties of the LR test of the general linear hypothesis 


24.36 We saw in 24.27 that, apart from the case r = 1, there is no UMPU test 
of the general linear hypothesis. Nevertheless, the LR test of that hypothesis has 
certain optimum properties which we now proceed to develop, making use of simplified 
proofs due to Wolfowitz (1949) and Lehmann (1950). 

In 24.28 we derived the ML estimators of the (k—r+1) unspecified parameters 
when H, holds. ‘They are the components of 


= (, 6°), 
which are defined above (24.98). When H, holds, the components of t are a set of 
(k—r+1) sufficient statistics for the unspecified parameters. By 23.10, their distribu- 
tion is complete. ‘Thus, by 23.19, every similar size-« critical region w for Hy will 
consist of a fraction « of every surface t = constant. Here every component of t is 
to be constant, and in particular the component 67. Let 


né®== X zi+ BX (2;—Cg,)? = a’, (24.110) 
=1 


=k+1 
where a is a constant. 

Now consider a fixed value of A, defined at (24.101), say 2 = d2 > 0. The power 
of any similar region on this surface will consist of the aggregate of its power on 
(24.110) for all a. For fixed a, the power on the surface 2 = d? is 


P(w|4,a) = | Le po?) dz, (24.111) 
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where L is the LF defined at (24.95). We may write this out fully as 


P(w|d,a) = Qmaty-nt [exp — sal (@—€)—(Hr—e0)}"{(@-—e0) (Hr e4)} 


+ (Zyr = Uy) (2+ << Ur) + ) % 31 \ dz. (24.1 12) 
Using (24.110) and (24.101), (24.112) becomes 


2 
P(w| A, a) = (20 07)-2@-#N exp{-a(@+S) hI exp {(Z— Co) (us —Co)} dZ,, (24.113) 


the vector z,_, having been integrated out over its whole range since its distribution 
is free of Aand independent ofa. ‘The only non-constant factor in (24.113) is the integral, 
which is to be maximized to obtain the critical region w with maximum P. The 
integral is over the surface 2 = d? or (p,—Co)’(u,—C€y) = constant. It is clearly 


r 
a monotone increasing function of |z,—Cpo| i.e. of (Z,—Co)’ (Z,—Co) = = (= Co)? 


Now if z (%;—Co;)? is maximized for fixed a in (24.110), W defined at (24.100) is 


also nuclide Thus for any fixed 4 and a, the maximum value of P(w|/,qa) 1s 
attained when w consists of large values of W. Since this holds for each a, it holds 
when the restriction that a be fixed is removed. We have therefore established that 
on any surface 2 = d? > 0, the LR test, which consists of rejecting large values of W, 
has maximum power, a result due to Wald (1942). 

An immediate consequence is P. L. Hsu’s (1941) result, that the LR test is UMP 
among all tests whose power is a function of 4 only. 


Invariant tests 

24.37 In developing unbiassed tests for location parameters in 24.20, we found it 
quite natural to introduce the invariance condition (24.61) as a necessary condition 
which any acceptable test must satisfy. Similarly for scale parameters in 24.21, the 
logarithmic transformation from (24.68) to (24.69) requires implicitly that the test 
statistic ¢ satisfies 

Vas Ys <2 x Bad =e hn et Va e >t. (24.114) 

Frequently, it is reasonable to restrict the class of tests considered to those which are 
invariant under transformations which leave the hypothesis to be tested invariant ; 
if this is not done, e.g. in the problem of testing the equality of location (or scale) para- 
meters, it would mean that a change of origin (or unit) of measurement would affect 
the conclusions reached by the test. The relationship between invariance and sufficiency 
principles in general is discussed by Hall et al. (1965), with a theorem due to C. Stein 
which gives conditions under which it does not matter in which order the principles 
are applied. 

If we examine the canonical form of the general linear hypothesis in 24.27 from 
this point of view, we see at once that the problem is invariant under : 


(a) any orthogonal transformation of (z,—C ) (this leaves (z,—€,)’(Z,—€o) un- 
changed) ; 
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(b) any orthogonal transformation of z,_, (this leaves Z,_4Zn—, unchanged) ; 

(c) the addition of any constant a to each component of z,_, (the mean vector 
of which is arbitrary) ; 

(d) the multiplication of all the variables by c > 0 (which affects only the com- 
mon variance o*). 


It is easily seen that a statistic ¢ is invariant under all the operations (a) to (d) if, and 
only if, it is a function of W = (Z,—€9)'(Z;—€o)/Zn—zZn—z alone. Clearly if ¢ is a 
function of W alone, its power function, like that of W, will depend only on 42. By 
the last sentence of 24.36, therefore, the LR test, rejecting large values of W, is UMP 
among invariant tests of the general linear hypothesis. 


EXERCISES 


24.1 Show that the c.f. of the non-central x? distribution (24.18) is 
hit 
— — 974)—¥/2 
p(t) = (1 —2z:t) exp fame 


giving cumulants x, = (v+rd)2"-\(r—1)!. In particular, 
4, = v+A, Ho = 2(v4+2A), 
%, = 8(v+3<A), 4 = 48(v+ 4A). 
Hence show that the sum of two independent non-central y? variates is another such, 


with both degrees of freedom and non-central parameter equal to the sum of those of 
the component distributions. 


(Wishart, 1932; ‘Tang, 1938) 


24.2 Show that if the non-central normal variates x; of 24.4 are subjected to k 
orthogonal linear constraints 


n 
Dy aiyxi = bj 2 Se ee 
nr 
where Ee = 2 eee = 4, iz =* 
i=1 


n k 
then y? = Lxve— XB 
aa | 4 


has the non-central y? distribution with (n—k) degrees of freedom and non-central 
n k 
parameter A = XY pjZ-— & ( Li Aij [i 
t=1 g=1 1 
(Patnaik (1949). Cf. also Bateman (1949).) 


24.3 Show that for any fixed r, the first r moments of a non-central x? distribution 
with fixed 4 tend, as degrees of freedom increase, to the corresponding moments of the 
central y? distribution with the same degrees of freedom. Hence show that, in testing a 
hypothesis H, distinguished from the alternative hypothesis by the value of a parameter 0, 
if the test statistic has a non-central y? distribution with degrees of freedom an increasing 
function of sample size n, and non-central parameter 2 a non-increasing function of n 
such that A = 0 when H, holds, then the test will become ineffective as n—>, i.e. its 
power will tend to its size a. 
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24.4 Show that the LR statistic / defined by (24.40) for testing the equality of k 
normal variances has moments about zero 


, @P gab} * PEG +1)u-1)} 
“= TAL +1)n—k]} ina m™ LT (mi —1) } 
(Neyman and Pearson, 1931) 


24.5 For testing the hypothesis H, that k normal distributions are identical in mean 
and variance, show that the LR statistic is, for sample sizes m > 2, 


k g2\ 1/2 
i=1 \ So 


nm 1 k 
where s=— LX (xy—-x)?, F=-— LD mx 
Ni j=1 nN j=1 
1 k 
and 3 = : dX ni{s?+ (Xi — x)? }, 
i=1 


and that its moments about zero are 
: mml {s(n—1)}  * Vf [(r+1)m—1]} 


MS TEAL + Dn—1)} ica iT G1) } 
(Neyman and Pearson, 1931) 


24.6 For testing the hypothesis H, that k normal distributions with the same variance 
have equal means, show that the LR statistic (with sample sizes nj > 2) is 


: le = L,/ l 
where / and J, are as defined for Exercises 24.4 and 24.5, and that the exact distribution 
of z = 1—/1,? when H, holds is 
dF oc z2h—3)(1 — g)2(n—k—-2) ge. 6 2 <1, 
Find the moments of /, and hence show that when the hypothesis H, of Exercise 24.5 holds, 
l and /, are independently distributed. 


(Neyman and Pearson, 1931; Hogg (1961). See 
also Hogg (1962) for a test of Hg.) 


24.7 Show that when all the sample sizes nj are equal, the LR statistic 1 of (24.40) 
and its modified form 1* of (24.44) are connected by the relation 


nlogl* = (n—k) log], 
so that in this case the tests based on / and /* are equivalent. 


24.8 For samples from k distributions of form (24.48) or (24.52), show that if J is 
the LR statistic for testing the hypothesis 
Hg: 9; = 0, =... = Op, 3 Op,41 = Op40 =... = Op,3 Op41 =... = Oy; 
e > On, +1 — «co — On, 
that the 6; fall into r distinct groups (not necessarily of equal size) within which they are 


equal, then —2log/ is distributed exactly like y? with 2(n—r) degrees of freedom. 
(Hogg, 1956) 


24.9 In a sample of m observations from 


dx 
70’ Le x < ut 
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show that the LR statistic for testing Hy: wu = 0 is 


1 = (*@a—*w)\"_ (k\" 
22 22 
where z = max {—X(1), X(n)}. Using Exercise 23.7, show that / and z are independently 
distributed, so that we have the factorization of c.f’s 
E[exp {(—2log R")it}] = E[exp {(—2logl)it}] E [exp { [—2 log (2z)" ]it}]. 
(a=) 
ni —2i)—1 so that, asm —> 00, —2log/l 
is distributed like y? with 2 degrees of freedom. 


Hence show that the c.f. of —2logl is ¢(t) = 


(Hogg and Craig, 1956) 


24.10 In 24.6, show that a quadratic form x’ Ax has a non-central xy? distribution 
if and only if AV is idempotent, and that if the distribution has m degrees of freedom 
this implies A = V—?. 


k 
24.11 kindependent samples, of sizes mj > 2, LX nj = n, are taken from exponential 


populations 


ae? 
dF; (x) = exp {—(* = ‘)has/a 0; *% = OO. 


Show that the LR statistic for testing 


is 
ed n 
lj = Il d;‘/d" 
i=1 
where di = Xi — (x4), the difference between the mean and smallest observation in the 
ith sample, and d is the same function of the combined samples, 1.e. 
d = X% — X11): 
Show that the moments of /j/" are 
,. wT (n—1) * PF {(m—1)+pm/n} 
oe TG 0 1 neler (ng—1) * 
(P. V. Sukhatme, 1936) 


24.12 In Exercise 24.11, show that for testing 
H,:0; =, = ooo = Ok, 
the 6; being unspecified, the LR statistic is 


and that the moments of /j/”" are 
, WE (n—k) * VT {(m—1)+pni/n} 


"PV (n—k+p) int mP"/"T (i —1) 
(P. V. Sukhatme, 1936) 
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24.13 In Exercise 24.11, show that if it is known that the oj are all equal, the LR 
statistic for testing 
20, = 0, = eee = Oy 
is 
l, = [g/l 
where /, and /, are defined in Exercises 24.11-12. Show that the exact distribution 
of i" = wis 


1 
pik b=) 


and find the moments of u. Show that when H, of Exercise 24.11 holds, J, and /, are 
independently distributed. 


dF n—k-1 (4 —y)*-? du, 0<u <i, 


(P. V. Sukhatme, 1936; Hogg (1961). Cf. also Hogg 
and Tanis (1963) for other tests of these hypotheses.) 


24.14 Show by comparison with the unbiassed test of Exercise 23.17 that the LR 
test for the hypothesis that two normal populations have equal variances is biassed for 
unequal sample sizes 7, 19. 


24.15 Show by using (24.67) that an unbiassed similar size-« test of the hypothesis 
H, that k independent observations x;(¢ = 1, 2, ..., k) from normal populations with 
unit variance have equal means is given by the critical region 


k 
us (a — *)P7 > ey, 
i=1 


where c, is the 100(1—«) per cent point of the distribution of y? with (z—1) degrees of 
freedom. Show that this is also the LR test. 


? 


24.16 Show that the three test statistics of Exercises 23.24—26 are equivalent to the 
LR statistics in the situations given; that the critical region of the LR test in Exer- 
cise 23.24 is not the UMPU region and is in fact biassed; but that in the other two 
Exercises the LR test coincides with the UMP similar test. 


24.17 Extending the results of 23.10-13, show that if a distribution is of form 


F(%| 41,02, .. +, 9%) = O(6)M (x) exp{ 3 Bj (x) Aj (03, 94, ..- » x) }; 
a(9,,0,) < x < b(6,, 82) 
(the terminals of the distribution depending only on the two parameters not entering into 
the exponential term), the statistics ¢; = x), te = Xn), tj = py B;(xi) are jointly 


sufficient for @ in a sample of n observations, and that their distribution is complete. 
(Hogg and Craig, 1956) 


24.18 Using the result of Exercise 24.17, show that in independent samples of sizes 
N1, N_, from two distributions 


= O0A\-aes 
dF = exp {SONS o>: = Sap: 134 
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the statistics 
2, = min {x1}, 
Ny Ns 
Z2= Lxyt DL xy, 
j=1 j=1 
are sufficient for 6, and 8, and complete. 
Show that the LR statistic for Hy: 6, = 9, is 


— ‘ a— (1X) + Ne sg es 


Z_— (my +2) 2 


and that / is distributed independently of 2,, 2, and hence of its denominator. Show 


that / gives an unbiassed test of Hp. 
(Paulson, 1941) 


24.19 Generalizing the result of Exercise 16.7, show that the d.f. of the non-central 
y* distribution (24.18) is given, for even », by 
H(z) = Prob{u—v > 37}, 
where u and w are independent Poisson variates with parameters 42 and 4A respectively. 


(Johnson, 1959a) 


CHAPTER 25 


THE COMPARISON OF TESTS 


25.1 In Chapters 22~24 we have been concerned with the problems of finding 
“ optimum ”’ tests, i.e. of selecting the test with the ‘‘best”’ properties in a given 
situation, where ‘‘ best”? means the possession by the test of some desirable property 
such as being UMP, UMPU, etc. We have not so far considered the question of 
comparing two or more tests for a given situation with the aim of evaluating their 
relative efficiencies. Some investigation of this subject is necessary to permit us to 
evaluate the loss of efficiency incurred in using any other test than the optimum one. 
It may happen, for example, that a UMP test is only very slightly more powerful than 
another test, which is perhaps much simpler to compute; in such circumstances we 
might well decide to use the less efficient test in routine testing. Before we can decide 
an issue such as this, we must make some quantitative comparison between the tests. 

We discussed the analogous problem in the theory of estimation in 17.29, where 
we derived a measure of estimating efficiency. ‘The reader will perhaps ask how it 
comes about that, whereas in the theory of estimation the measurement of efficiency 
was discussed almost as soon as the concept of efficiency had been defined, we have 
left over the question of measuring test efficiency to the end of our general discussion 
of the theory of tests. The answer is partly that the concept of test efficiency turns 
out to be more complicated than that of estimating efficiency, and therefore could not 
be so shortly treated. For the most part, however, we are simply following the histori- 
cal development of the subject : it was not until, from about 1935 onwards, the atten- 
tion of statisticians turned to the computationally simple tests to be discussed in 
Chapters 31 and 32 that the need arose to measure test efficiency. Even the idea 
of test consistency, which we encountered in 24.17, was not developed by Wald and 
Wolfowitz (1940) until nearly twenty years after the first definition of a consistent 
estimator by Fisher (1921a) ; only when “ inefficient ”’ tests became of practical interest 
was it necessary to investigate the weaker properties of tests. 


The comparison of power functions 


25.2 In testing a given hypothesis against a given alternative for fixed sample 
size, the simplest way of comparing two tests is by direct examination of their power 
functions. If sample size is at our disposal (e.g. in the planning of a series of observa- 
tions), it is natural to seek a definition of test efficiency of the same form as that used 
for estimating efficiency in 17.29. If an “ efficient”’ test (i.e. the most powerful in 
the class considered) of size « requires to be based on m, observations to attain a cer- 
tain power, and a second size-« test requires m, observations to attain the same power 
against the same alternative, we may define the relative efficiency of the second test in 
attaining that power against that alternative as ,/n,. ‘This measure is, as in the 
case of estimation, the reciprocal of the ratio of sample sizes required for a given per- 
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formance, but it will be noticed that our definition of relative efficiency is not asymp- 
totic, and that it imposes no restriction upon the forms of the sampling distributions 
of the test statistics being compared. We can compare any two tests in this way 
because the power functions of the tests, from which the relative efficiency is calcu- 
lated, take comprehensive account of the distributions of the test statistics ; the power 
functions contain all the information relevant to our comparison. 


Asymptotic comparisons 


25.3 The concept of relative efficiency, although comprehensive, is not concise. 
Like the power functions on which it is based, it is a function of three arguments— 
the size « of the tests, the “‘ distance” (in terms of some parameter 0) between the 
hypothesis tested and the alternative, and the sample size (m,) required by the efficient 
test. Even if we may confine ourselves to a few typical values of «, a table of double 
entry is still required for the comparison of tests by this measure. It would be much 
more convenient if we could find a single summary measure of efficiency, and it is 
clear that we can only hope to achieve this by imposing some limiting process. We 
have thus been brought back to the necessity for restriction to asymptotic results. 


25.4 A different approach which suggests itself is that we let sample sizes tend to 
infinity, as in 17.29, and take the ratio of the powers of the tests as our measure of test 
efficiency. If we consider this suggestion we immediately encounter a difficulty. If 
the tests we are considering are both size-« consistent tests against the class of alterna- 
tive hypotheses in the problem (and henceforth we shall always assume this to be so), 
it follows by definition that the power function of each tends to 1 as sample size 
increases. If we compare the tests against some fixed alternative value of 6, it follows 
that the efficiency thus defined will always tend to 1 as sample size increases. ‘The 
suggested measure of test efficiency is therefore quite useless. 

More generally, it is easy to see that consideration of the power functions of con- 
sistent tests asymptotically in 2 is of limited value. For example, Wald (1941) defined 
an asymptotically most powerful test as one whose power function cannot be bettered 
as sample size tends to infinity, i.e. which is UMP asymptotically. ‘The following 
example, due to Lehmann (1949), shows that one asymptotically UMP test may in fact 
be decidedly inferior to another such test, even asymptotically. 


Example 25.1 


Consider again the problem, discussed in Examples 22.1 and 22.2, of testing the 
mean @ of a normal distribution with known variance, taken to be equal to 1. We 
wish to test Hy: 0 = 0, against the one-sided alternative H,:0 = 0, > 65. In 22.17, 
we saw that a UMP test of H, against H, is given by the critical region « > 0)+d,/n', 
and in Example 22.3 that its power function is 


P, = 6 hiet=@ 1 S44 ie, Ant}, (25.1) 


where A = 6,—0, and the fixed value d, defines the size « of the test as at (22.16). 
S 
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We now construct a two-tailed size-« test, rejecting Hy when 
% > O9+d,,/n* or & < 0)—d,,/n', 
where d,, and d,,, functions of m, may be chosen arbitrarily subject to the condition 


&%,+%, = a, which implies that d,, and d,, both exceed d,. (23.48) shows that the 
power function of this second test is 


P, = G{An'—d, }+G{-—An?—d, }, (25.2) 
and since G is always positive, it follows that 
P, > G{An'—d,,} = 1—G{d,,— Ant}. (25.3) 
Since the first test is UMP, we have, from (25.1) and (25.3), 
G {d,,—An*}—G{d,—An?} > P,—P, > 0. (25.4) 
It is easily seen that the difference between G{x} and G{y} for fixed (x—¥y) is maxi- 
mized when x and y are symmetrically placed about zero, i.e. when x = —y, 1.e. that 
G{t(x—y)}-G{-4(*—-y)} 2 G{x}-G{y}. (25.5) 
Applying (25.5) to (25.4), we have 
O(a.) 6 CEE > PAP SO. (25.6) 
Thus if we choose d,, for each m so that 
lim d,, =d,, (25.7) 
n—> 0 


the left-hand side of (25.6) will tend to zero, whence P, — P, will tend to zero uniformly 
in A. The two-tailed test will therefore be asymptotically UMP. 
Now consider the ratio of Type II errors of the tests. From (25.1) and (25.2), 
we have 
1—-P,  G{d,,-An'}—G{-d,,—An*} 
1 —P, = G{d,—An*} 
As nt —> oo, numerator and denominator of (25.8) tend to zero. Using L’Hopital’s 
rule, we find, using a prime to denote differentiation with respect to m* and writing g 
for the normal f.f., 
SS SS eee (d,,— A) g {d,,— An*} Gt 
- ee SS: —_ —Agid,—An; = —Agi{d,—An*} 7 
Now (25.7) implies that d,,—> oo with n, and therefore that the second term on the 
right of (25.9) tends to zero: (25.7) also implies that the first term on the right of 
(25.9) will tend to infinity if 


= d..,.2 {d.., = A n* } 


(25.8) 


exp {—?(¢,,— An')?} 


ie g{d,—An*} = pi ~ 4s exp {—1(d,— Ant) 
= lim —d,, exp {—4(d2 —dz)+ An (d,,—4,)} (25.10) 
n—> © 


does so. By (25.7), the first term in the exponent on the right of (25.10) tends to 
zero. If we put 

é, 2 4,+n2, 0 <6< 4, C515) 
(25.7) is satisfied and (25.10) tends to infinity with m. Thus, although both tests are 
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asymptotically UMP, the ratio of Type II errors (25.8) tends to infinity with n. It is clear, 
therefore, that the criterion of being asymptotically UMP 1s not a very selective one. 


Asymptotic relative efficiency 

25.5 In order to obtain a useful asymptotic measure of test efficiency from the 
relative efficiency, we consider the limiting relative efficiency of tests against a sequence 
of alternative hypotheses in which 6 approaches the value tested, 65, as m increases. 
This type of alternative was first investigated by Pitman (1948), whose work was general- 
ized by Noether (1955). Other types of limiting process on relative efficiency are 
considered by Dixon (1953b) and Hodges and Lehmann (1956). 

Let ¢, and ¢, be consistent test statistics for the hypothesis H,: 0 = 6, against the 
one-sided alternative H,:6 > 0). We assume for the moment that ¢, and ¢, are 
asymptotically normally distributed whatever the value of 6—we shall relax this restric- 
tion in 25.14-15. For brevity, we shall write 

E(t,;|H;) = Eis, 
var (t;| Hj) = Dij, 


or 
EO (0) = gp Bev i= 1,2; j= 0,1. 
Df = Day DB = Di (bo) 
Large-sample size-« tests are defined by the critical regions 
t, > Eig ty Dio (25.12) 


(the sign of t; being changed if necessary to make the region of this form), where A, 
is the normal deviate defined by G{—A,} = «, G being the standardized normal d.f. 
as before. Just as in Example 22.3, the asymptotic power function of f, is 


P; (6) = G{[Eiy— (Ein t 4a Dio) ]/Dir}- (25.13) 

Writing u;(0,4,) for the argument of G in (25.13), we expand (£;,—£;,) in a Taylor 
series, obtaining (0—0,)™ 

u,(0,44) = Ce (67) Co _ 4, Dao| / Dix (25.14) 


where 6, < 0% < 6 and m, is the first non-zero derivative at 0, i.e., m, is defined by 


EY (0,) = 0, Se 
Es (05) # 0. } (25.15) 
In order to define the alternative hypothesis, we assume that, as 2 —> oo, 
Baie Oa ~ 298. (25.16) 


(25.16) defines the constants 6; > 0 and c; Now consider the sequences of alter- 
natives, approaching 0, as n—> oo, 


t= Oras, (25.17) 
where k; is an arbitrary positive constant. If the regularity conditions 
_ EM) > 
pn oP lg cere me se 
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are satisfied, (25.16), (25.17) and (25.18) reduce (25.14) to 


ee (25.19) 


and the asymptotic powers of the tests are G{u;} from (25.13). 


25.6 If the two tests are to have equal power against the same sequence of alter- 
natives for any fixed «, we must have, from (25.17) and (25.19), 


ky _ ke 


aa (25.20) 
and 
Cx "BP, as bal (25.21) 


m,! m,!’ 


where n, and n, are the sample sizes upon which ¢, and ¢, are based. We combine 
(25.20) and (25.21) into 
ni = C2 M1! 1 m,—m, - 
me ( ero - (25.22) 
The right-hand side of (25.22) is a positive constant. Thus if we let 2,,”,—> 00, the 
ratio n,/n, will tend to a constant if and only if 6, = 65. If 6, > 6,, we must have 
n/n, —> 0, while if 6, < 6, we have n,/n,—> oo. If we define the asymptotic relative 
efficiency (ARE) of ¢, compared to 7, as 
Awad, (25.23) 
Ne 7 
we therefore have the result 
As, = 0, 6, SOs (25.24) 
Thus to compare two tests by the criterion of ARE, we first compare their values of 6: 
if one has a smaller 6 than the other, it has ARE of zero compared to the other. The 
value of 6 plays the same role here as the order of magnitude of the variance plays in 
measuring efficiency of estimation (cf. 17.29). 
We may now confine ourselves to the case 6, = 6, = 6. (25.22) and (25.23) then 
give 


| 1/(m,9) 
Ay, = lim ™ = (2s age) (25.25) 
2 = 
If, in addition, 
7; = wh, = wi, (25.26) 


(25.25) reduces to 
Ax 


co\ W/m) 
Cy 


which on using (25.16) becomes 


Po EE” (0 a /D oa 1/(m6) 
Aa = lim {pm (o0/D2) See 
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(25.27) is simple to evaluate in most cases, and we shall be. using it extensively in later 
chapters to evaluate the ARE of particular tests. Most commonly, 6 = 3 (corres- 
ponding to an estimation variance of order n-!) and m = 1. For an interpretation of 
the value of m, see 25.10 below. 

In passing, we may note that if m, 4 my, (25.25) is indeterminate, depending as 
it does on the arbitrary constant k,. We therefore see that tests with equal values of 6 
do not have the same ARE against all sequences of alternatives (25.17) unless they also 
have equal values of m. We shall be commenting on the reasons for this in 25.10. 


25.7 If we wish to test H, against the two-sided H,:6 4 6, our results for the 
ARE are unaffected if we use ‘ equal-tails”’ critical regions of the form 
t, > EigtdAwDig or ty < Ejg—ADio, 
for the asymptotic power functions (25.13) are replaced by 
Q; (9) = Gi ui, Aya) 5 +1— Gt Ui (9, — Aye) js (25.28) 
and QO, = Q, against the alternative (25.17) (where k; need no longer be positive) if 


(25.20) and (25.21) hold, as before. Konijn (1956) gives a more general treatment 
of two-sided tests, which need not necessarily be ‘ equal-tails”’ tests. 


Example 25.2 

Let us compare the sample median * with the UMP sample mean & in testing 
the mean 0 of a normal distribution with known variance o?, Both statistics are asymp- 
totically normally distributed. We know that 

E(*#) = 0, D?(*#|0) = o?/n 
and # is a consistent estimator of 0, with | 
E(#) = 0, D?(#|0) ~ wo? /(2n) 
(cf. Example 10.7). ‘Thus we have 
E’ (6)) = 1 

for both tests, so that m, = #, = 1, while trom (25.16), 0, = 6, = 4... ‘Thus, ‘from 


(25.27); 
1 /(a o? /2n)? \ Fs 
eal eee ee = ER 
As tim { 1 /(c?/n)! X 
This is precisely the result we obtained in Example 17.12 for the efficiency of % in 
estimating 90. We shall see in 25.13 that this is a special case of a general relationship 
between estimating efficiency and ARE for tests. 


ARE and the derivatives of the power functions 

25.8 ‘The nature of the sequence of alternative hypotheses (25.17), which approaches 
6, as m—> oo, makes it clear that the ARE is in some way related to the behaviour, 
near 0, of the power functions of the tests being compared. We shall make this rela- 
tionship more precise by showing that, under certain conditions, the ARE is a simple 
function of the ratio of derivatives of the power functions. 

We first treat the case of the one-sided H, discussed in 25.5-6, where the power 


268 _ THE ADVANCED THEORY OF STATISTICS 


functions of the tests are asymptotically given by (25.13), which we write, as before, 


P,(0) = G{u, (0, A,) }. (25:29) 
Differentiating with respect to 6, we have 
P; (9) = g{u;}u; (0, A.) (25.30) 
where g is the normal frequency function. From (25.13) we find 
(0,4.) = a Fe Ea Bi, Did (25.31) 
As n— oo, we find, using (25.18), and the further regularity conditions 
E, 
li = = 2 a1 SS i. 
ace Dip n—>o £10 
that (25.31) becomes 
E; (90), Dio 
u; (0,2) = oor +p te (25.52) 
so that if m,; = 1 in (25.15) and if 
Di 
lim: +73 = 0, 25. 
apco Belo) = 
(25.32) reduces at 0, to 
U; (Do) 4a) ~ Ej (8o)/Dio- (25.34) 
Since, from (25.13), 3 
{Ui (Do) 4a) } = {Aas (25.35) 
(25.30) becomes, on substituting (25.34) and (25.35), 
Pi (00) = Pi (O04) ~ £{—Aa} Ei (90) /Dio- (25.36) 
Remembering that m; = 1, we therefore have from (25.36) and (25.27) 
P, 6 


so that the asymptotic ratio of _ first derivatives of the power functions of the tests 
at 0, is simply the ARE raised to the power 6 (commonly 3). ‘Thus if we were to use 
this ratio as a criterion of asymptotic efficiency of tests, we should get precisely the same 
results as by using the ARE. This criterion was, in fact, proposed (under the name 
“asymptotic local efficiency’) by Blomqvist (1950). 


25.9 If m, > 1, ie. E;(0,) = 0, (25.36) is zero to our order of approximation and 
the result of 25.8 is of no use. The differentiation process has to be taken further 


to yield useful results. 
From (25.30), we obtain 


PY (0) = BAe (0, Ag) }? +g { m; } 1!" (8, 4,)- (25.38) 
From (25.31), 
Ui (0,4) = $8 — PEM _ (By — Bio eDio) Fe — (25.39) 


3 
Di 


THE COMPARISON OF TESTS 269 
If (25.18) holds with m; = 2 and also the regularity conditions below (25.31) and 


ee = ee Se ee i 
er ge I) be gy ie ee 
(25.39) gives 
nS a ee 75 
Ui! Borda) = = +4, se 2(5") | (25.41) 
Instead of (25.33), we now assume the conditions 
ee 
eG: wed Oa) oo 
(25.42) reduces (25.41) to 
ti; (9, ha) = E; (0o)/Dio- (25.43) 
Returning now to (25.38), we see that since 
Og{us} _ == 
<= ug {ui}, 
we have, using (25.32), (25.35) and (25.43) in (25.38), 
: 2 Ei (90) , Dio , 12, Ex’ Oo) 
P!'(0,) ~ gf aa} ove aoe | +} (25.4) 


Since we are considering the case m; = 2 here, the term in E; (09) is zero, and from the 
second condition of (25.42), (25.44) may finally be written 
Pi! (90) ~ B{ Aa} Ei (90)/Dios (25.45) 
whence, with m = 2, (25.27) gives 
_ Ps (90) 
woo Py) 
for the limiting ratio of the second derivatives. 

(25.37) and (25.46) may be expressed concisely by the statement that for m = 1, 2, 
the ratio of the mth derivatives of the power functions of one-sided tests is asymptotically 
equal to the ARE raised to the power mé. 

If, instead of (25.33) and (25.42), we had imposed the stronger conditions 

lim Dip/Dip = 0, lim Dio/Dio = 9, = (25.47) 
n—> 00 n—> 00 
which with (25.16) imply (25.33) and (25.42), (25.34) and (25.43) would have followed 
from (25.32) and (25.41) as before. (25.47) may be easier to verify in particular cases. 


so (25.46) 


The interpretation of the value of m 

25.10 We now discuss the general conditions under which m will take the value 
1 or 2. Consider again the asymptotic power function (25.13) for a one-sided alter- 
native H,:6 > 0, and a one-tailed test (25.12). For brevity, we drop the suffix “72” 
in this section. If 6 —>6,, and D,—> Dy, by (25.18), it becomes 


sae ogee Beg 
P(o) = Gt = i}, 


a monotone increasing function of (£,—E£,). 
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If (£,—£,) is an increasing function of (@—6)), P(6)—> 0 as 6—» — oo (which 
implies that the other “ tail”’ of the distribution of the test statistic would be used as 
a critical region if 6 < 6). If E’(0,) exists, it is non-zero and m = 1, and P’ (8) 4 0 
also, by (25.36). 

If, on the other hand, (£,—£,) is an even function of (9—6,), (which implies that 
the same “‘ tail’ would be used as critical region whatever the sign of (6—6))), and 
an increasing function of |@—6 |, and £’ (6) exists, it must under regularity conditions 
equal zero, and m > 1—in practice, we find m = 2. By (25.36), P’(0)) = 0 also to 
this order of approximation. 

We are now in a position to see why, as remarked at the end of 25.6, the ARE is 
not useful in comparing tests with differing values of m, which in practice are 1 and 2. 
For we are then comparing tests whose power functions behave essentially differently 
at 0), one having a regular minimum there and the other not. ‘The indeterminacy of 
(25.25) in such circumstances is not really surprising. It should be added that this 
indeterminacy is, at the time of writing, of purely theoretical interest, since no case 
seems to be known to which it applies. 


Example 25.3 


Consider the problem of testing H,: 6 = 0, for a normal distribution with mean 6 
and variance 1. ‘The pair of one-tailed tests based on the sample mean #* are UMP 
(cf. 22.17), the upper or lower tail being selected according to whether H, is 0 > 0, 
or 6 < 6). From Example 25.2, 6 = 4 and m = 1 for x. 

We could also use as a test statistic 


S = > (x;—65)?. 
i=1 
S has a non-central chi-squared distribution with n degrees of freedom and non- 
central parameter n(6—6,)?, so that (cf. Exercise 24.1) 
E(S|0) = n{1+(6-6,)?}, 
D? (S| 09) = 2n, 


and as n—> oo, S is asymptotically normally distributed. We have E£’(0) = 2n(6—9,), 
E’(6,) = 0, E' (0) = 22 = E" (0,), so that m = 2 and 

E" (89) _ _2n 

Do (2n)* 

From (25.16), since m = 2,6 = 4. Since 6 = 3 for #, the ARE of S compared to & 


is zero by (25.24). The critical region for S consists of the upper tail, whatever the 
value of 0. 


= (2n)t. 


25.11 We now turn to the case of the two-sided alternative H,:0 4 6). The 
power function of the ‘ equal-tails”’ test is given asymptotically by (25.28). Its 
derivative at 6, is 


QO; (6 0) = P; (6 0» Axx) a P; (99, = Aya)s (25.48) 
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where P; is given by (25.36) if m; = 1 and (25.33) or (25.47) holds. Since g{—A,} in 
(25.36) is an even function of A,, (25.48) immediately gives the asymptotic result 


Qj (90) ~ 0 
so that the slope of the power function at 6, is asymptotically zero. ‘This result is also 
implied (under regularity conditions) by the remark in 24.17 concerning the asymptotic 


unbiassedness of consistent tests. 
The second derivative of the power function (25.28) is 


i (80) = Pi’ (O05 Ara) — Pi’ (80; — Ara): | (25.49) 
We have evaluated P;’ at (25.44) where we had m; = 2. (25.44) still holds for m; = 1 
if we strengthen the first condition in (25.47) to 
Dio/ Dig = O(n), (25.50) 
for then by (25.16) the second term on the right of (25.39) may be neglected and we 
obtain (25.44) as before. Substituted into (25.49), it gives 


" == Ej (CZ) : Dio : 
a (9) 2dr £ i ina} ( Dis ) +(F24) i 
and (25.50) reduces this to 
! 2 
(0s) ~ Aree {Ae} (“Hy ) (25.51) 
i0 
In this case, therefore, (25.27) and (25.51) give 


2(9o) _ 428 
ay 
Thus for m = 1, the asymptotic ratio of second derivatives of the power functions of 
two-sided tests is exactly that given by (25.46) for one-sided tests when m = 2, and 
exactly the square of the one-sided test result for m = 1 at (25.37). 
The case m = 2 does not seem of much importance for two-tailed tests: the 
remarks in 25.10 suggest that where m = 2 a one-tailed test would often be used even 
against a two-sided H,. 


(25.52) 


Example 25.4 

Reverting to Example 25.2, we saw that both tests have 6 = 4, m = 1 and E’(6,) = 1. 
Since the variance of each statistic is independent of 0, at least asymptotically, we see 
that (25.33) and (25.50) are satisfied and, the regularity conditions being satisfied, it 
follows from (25.37) that for one-sided tests 


l / = ifs = 1 ]> 
ae P; (90) 7 


while for two-sided tests, from (25.52), 


; z (99) 2 
im Ss Ag, 5 ae 
n—>o Ve (9) WU 
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The maximum power loss and the ARE 
25.12 Although the ARE of tests essentially reflects their power properties in the 
neighbourhood of 6, it does have some implications for the asymptotic power function 
as a whole, at least for the case m = 1, to which we now confine ourselves. 
The power function P,(0) of a one-sided test is G{u;(0)}, where u;(8), given at 
(25.14), is asymptotically equal, under regularity conditions (25.18), to 
= HC) (6-64) te | (25.53) 
ig 20 
when m; = 1. Thus u;(0) is asymptotically linear in 0. If we write R; = E;(6)/Dio 
as at (25.16), we may write the difference between two such power functions as 
d(0) = P4(0)—P,(0) = G{0—0.) Ra—2a}—G{(0- 00) RaRi—ah (25.54 
2 
where we assume R, > R, without loss of generality. Consider the behaviour of 
d(0) as a function of 0. When = 6), d = 0, and again as 6 tends to infinity P, and P, 
both tend to 1 and dto zero. The maximum value of d(6) depends only on the ratio 
R,/R,, for although R, appears in the right-hand side of (25.54) it is always the coefhi- 
cient of (9—6,), which is being varied from 0 to 00, so that R,(8—O9) also goes from 
0 to co whatever the value of R,. We therefore write A = R,(0—6,) in (25.54), 
obtaining 
d(A) = G{A-2}-G {AB Aa}. (25.55) 
2 
The first derivative of (25.55) with respect to A is 


R R 
(A) = eo { A—A,}-—'9+A—- 
d'( ) gt A} REY x, an}, 
and if this is equated to zero, we have 


1 ee R, : 
= exp { —3(A—A,)?+3 Aa 4a 
2 


2 
= exp {— 4. (: -R)ted (1-3) }. (25.56) 
(25.56) is a quadratic equation in A, whose only positive root is 
bet y= At 
dats A242 R, OER 
a a (25.57) 


This is the value at which (25.55) is maximized. Consider, for example, the case 
o = 0-:05(A, = 1-645) and R,/R, = 0-5. (25.57) gives 
es 1-645 + {1-6457+ 6log,2}* _ 2.8 


1-5 : 
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(25.55) then gives, using tables of the normal d.f,, 

P, = G{2:85—1-64} = G{1-21} = 0-89, 

P, = G{1-42—1-64} = G{—0-22} = 0-41. 
D. R. Cox and Stuart (1955) gave values of P, and P, at the point of maximum difference, 
obtained by the graphical equivalent of the above method, for a range of values of 


aand R,/R,. Their table is reproduced below, our worked example above being one 
of the entries. 


Asymptotic powers of tests at the point of greatest difference 
(D. R. Cox and Stuart, 1955) 


= 
oe 0-10 0-05 0-01 0-001 
ee 
R,/R; P, P, P, P, P, P, = \, 
0:9 67 73 63 Ti 49 60 54 67 
0:8 61 74 56 72 49 71 43 42 
0-7 59 80 51 77 42 77 39 83 
0-6 54 84 47 84 39 86 29 87 
0°5 48 88 41 89 30 90 20 93 
0:3 35 96 a7 96 14 97 7 99 


(Decimal points are omitted.) 


It will be seen from the table that as « decreases for fixed R,/R,, the maximum differ- 
ence between the asymptotic power functions increases steadily—it can, in fact, be 
made as near to 1 as desired by taking « small enough. Similarly, for fixed «, the 
maximum difference increases steadily as R,/R, falls. 

The practical consequence of the table is that if R,/R, is 0-9 or more, the loss of 
power along the whole course of the asymptotic power function will not exceed 0-08 
for « = 0-05, 0-11 for « = 0-01, and 0-13 for « = 0-001, the most commonly used 
test sizes. Since R,/R, is, from (25.36), the ratio of first derivatives of the power 
functions, we have from (25.37) that (R,/R,)'” = A;., where 6 is commonly 3, and 
thus the ARE needs to be (0-9)° for the statements above to be true. 


ARE and estimating efficiency 


25.13 There is a simple connexion between the ARE and estimating efficiency. 
If we have two consistent test statistics t; as before, we define functions f;, independent 
of nm, such that the statistics 


1, = fht (25.58) 
are consistent estimators of 0. If we write 
G = f,{%;), (25.59) 


it follows from (25.58) that since T; > 6 in probability, ¢; >t; and E(t¢;,) if it exists 


also tends to t;. Expanding (25.58) by Taylor’s theorem about t;, we have, using 
(2509), 


i= 6+(t;—T;) eee) (25.60) 


TV; h=t* 
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where #}, intermediate in value between ¢; and t;, tends to t; as m increases. ‘Thus 


(25.60) may be written 
00 
Te-8~ (| ave 


whence var T; ~ var t, / (“2 2 (25.61) 


If 26 is the order of magnitude in n of the variances of the T,, the estimating efficiency 


of 7, compared to 7, is, by (17.65) and (25.61), 


: var T,\V@) | {OE (¢.)/00 }*/var t, | 
oe: t= r:) = | aE @.) 700} ra “ SS 
At 05, (25.62) is precisely equal to the ARE (25.27) when m, = 1. Thus the ARE 
essentially gives the relative estimating efficiencies of transformations of the test statis- 
tics which are consistent estimators of the parameter concerned. But this corre- 
spondence is a local one: in 22.15 we saw that the connexion between estimating efficiency 
and power is not strong in general. It follows at once that tests based upon efficient 
estimators have maximum ARE and (from 25.8-11) that the derivatives of their power 
functions at 0) are maximized. (25.62) and (17.61) also imply that if 7, is efficient, 
Ay, = {p(T,, T.)}1. Amore general result in terms of p(t, tg) is given in Exercise 25.9. 


Example 25.5 | 

The result we have just obtained explains the fact, noted in Example 25.2, that 
the ARE of the sample median, compared to the sample mean, in testing the mean of 
a normal distribution has exactly the same value as its estimating efficiency for that 
parameter. 


Non-normal cases 

25.14 From 25.5 onwards, we have confined ourselves to the case of asymptoti- 
cally normally distributed test statistics. However, examination of 25.5-7 will show 
that in deriving the ARE we made no specific use of the normality assumption. We 
were concerned to establish the conditions under which the arguments u; of the power 
functions G{u;} in (25.19) would be equal against the sequence of alternatives (25.17). 
G played no role in the discussion other than of ensuring that the asymptotic power 
functions were of the same form, and we need only require that G is a regularly be- 
haved d.f. 

It follows that if two tests have asymptotic power functions of any two-parameter 
form G, only one of whose parameters is a function of 0, the results of 25.5—7 will hold, 
for (25.17) will fix this parameter and u; in (25.19) then determines the other. Given 
the form G, the critical region for one-tailed tests can always be put in the form (25.12), 
where A, is more generally interpreted as the multiple of D;) required to make (25.12) 
a size-« critical region. 


25.15 The only important limiting distributions other than the normal are the 
non-central y? distributions whose properties were discussed in 24.4-5. Suppose that 
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for the hypothesis Hy: 6 = 0, we have two test statistics: ¢; with such distributions, 
the degrees of freedom being »,; (independent of 6) and the non-central parameters 
4;(@), where A;(99) = 0, so that the x? distributions are central when H, holds. We 
have (cf. Exercise 24.1) 

Ey, = %+4,(0), Di = 20, (25.63) 

All the results of 25.5-6 for one-sided tests therefore hold for the comparison of 
test statistics distributed in the non-central y? form (central when H, holds) with 
degrees of freedom independent of 6. In particular, when 6, = 6, = dandm, = m, = m, 
(25.63) substituted into (25.27) gives 
a vel 7 

An = lim {iem(@°) Seek 

A different derivation of this result is given by E. J. Hannan (1956). 


Other measures of test efficiency 

25.16 Although in later chapters we shall use only the relative efficiency and the 
ARE as measures of test efficiency, we conclude this chapter by discussing two alter- 
native methods which have been proposed. 

Walsh (1946) proposed the comparison of two tests for fixed size « by a measure 
which takes into account the performance of the tests for all alternative hypothesis 
values of the parameter 0. If the tests ¢; are based on sample sizes n; and have power 
functions P;(6,n;), the efficiency of ¢, compared to ?, is m,/n, = e,, where 


| [P, (0,n;)—P,(9,ns)]d0 = 0. (25.65) 


Thus, given one of the sample sizes (say, 7), we choose , so that the algebraic sum of 
the areas between the power functions is zero, and measure efficiency by 1,/ng. 

This measure removes the effect of 0 from the table of triple entry required to 
compare two power functions, and does so in a reasonable way. However, ej, is still 
a function of « and, more important, of n,. Moreover, the calculation of 2,/n, so that 
(25.65) is satisfied is inevitably tedious and probably accounts for the fact that this 
measure has rarely been used. As an asymptotic measure, however, it is equivalent 
to the use of the ARE, at least for asymptotically normally distributed test statistics 
with m,; = 1 in (25.15). For we then have, as in 25.12, 


P; (0, 14) = G{(9—0,) R;—-A,}, 
where R; = E;(09)/Djo as at (25.16), and (25.65) then becomes 


| [G{(6—9,) R,—-A,}-—G{(0—65) R,—-A,}] d6 = 0. (25.66) 
Clearly, (25.66) holds asymptotically only when R, = R,, or, from (25.16), 


1/6 
whence lim “! = (2) =A ss, 


exactly as at (25.27) with m = 1. 
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25.17 Finally, we summarize a quite different approach to the problem of measur- 
ing asymptotic efficiency for tests, due to Chernoff (1952). For a variate x with moment- 
generating function M/,(¢t) = E(e™), we define 


m(a) = inf M,_.(t), (25.67) 


the absolute minimum value of the m.g.f. of (x—a). If E(x|H;) = 4, for simple 
hypotheses Hy, H,, we further define 

: p= inf max{m,(a),m,(a)}, (25.68) 

Mop SOS py : 

where the suffix to m, defined at (25.67), indicates the hypothesis. For a one-sided 
test of H, against H, based on a sum of n identically distributed x,, with size « and 
power 1—£, Chernoff shows that if any linear function /(«,f) of the probabilities of 
error « and f is minimized, its minimum value behaves as n —> © like p”, where p is 
defined by (25.68). Consider two such tests ¢,, based on samples of size n,._ If they 
have equal minima for /(«,8), we therefore have 


Po" 
or 
fag a Bes (25.69) 
ne loge, 


Thus the right-hand side of (25.69) is a measure of the asymptotic efficiency of ¢, 
compared to ¢,. Its use is restricted to test statistics based on sums of independent 
observations, and the computation required may be considerable. 


25.18 Hoeffding (1965) develops a method of comparison of tests when «—> 0 as 
n—> © (as distinct from the approach of 25.5 onwards, where « is held fixed and H,; —> Ho 
as 2 —> ©) and shows using this method that LR tests have an optimum property for 
tests on multinomial distributions. 


EXERCISES 


25.1 The Sign Test for the hypothesis H, that a population median takes a specified 
value 0, consists of counting the number of sample observations exceeding 9, and rejecting 
H, when this number is too large. Show that for a normal population this test has 
ARE 2/z compared to the ‘‘ Student’s ” t-test for Hy, and connect this with the result 
of Example 25.2. 

(Cochran, 1937) 


25.2 Generalizing the result of Exercise 25.1, show that for any continuous frequency 
function f with variance o?, the ARE of the Sign Test compared to the ¢-test is 
40? {f (00)}?. 

(Pitman, 1948) 
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25.3 The difference between the means of two normal populations with equal vari- 
ances is tested from two independent samples by comparing every observation 4; 
in the second sample with every observation x; in the first sample, and counting the 
number of times a yj exceeds an xi. Show that the ARE of this S-test compared to the 
two-sample ‘‘ Student’s””’ t-test is 3/z. 


(Pitman, 1948) 


25.4 Generalizing Exercise 25.3, show that if any two continuous frequency functions 
f (x), f (x— 9), differ only by a location parameter 6, and have variance o”, the ARE of the 
S-test compared to the t-test is 


12 aff (sonar 


25.5 Ifx is normally distributed with mean “; and variance oj, given Hi(i = 0, 1; 
Ho < 4), show that (25.68) has the value 


p = exp{—3[(41—Ho)/(41+ 40) ]? }. 


(Pitman, 1948) 


(Chernoff, 1952) 


25.6 If x/o?7 has a chi-squared distribution with r degrees of freedom, given 
Hilti = 0, 1; o3/of = t < 1), show that p in (25.68) satisfies 
logp = —4r(d—1-—log9d) 
where 


6 = (logt)/(t—1). 
(Chernoff, 1952) 


25.7 t, and t, are unbiassed estimators of 0, jointly normally distributed in large 
samples with variances o?, o7/e respectively (0 < e < 1). Using the results of 16.23 
and 17.29, show that 


E(t,|t,) = 6(1—e)+¢ee, 
and hence that if tf, is observed to differ from 9 by a multiple d of its standard deviation, 


we expect ¢t, to differ from #0 by a multiple de? of its standard deviation. 
(D. R. Cox, 1956) 


25.8 Using Exercise 25.7, show that if t, is used to test Hy: 9 = 49, we may calculate 
the ‘“‘ expected result’ of a test based on the more efficient statistic t,. In particular, 
show that if a one-tail test of size 0:01, using t,, rejects Hy, we should expect a one-tail 
test of size 0:05, using ¢t,, to do so if e > 0°50; while if an “‘ equal-tails ”’ size-0-01 test 
on ft, rejects Hy, we should expect an “‘ equal-tails ”’ size-0-05 test on t, to do so if e > 0°58. 


25.9 Let ¢t, be a statistic with maximum ARE and f, any other test statistic for the 
same problem with 6 and m as for t,. By considering the weighted average 


ty te 
tz = lee. : 
show that, in (25.16), 
R, = {aR,+(1—a)R,}/{a?+(1 —a)?+ 2a (1 —a)p}*, 
where p is the asymptotic correlation coefficient of t; and tz. Hence show that p = R,/R,, 


the (mo)th power of the ARE (25.27). 
(Cf. van Eeden, 1963) 
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STATISTICAL RELATIONSHIP: 
LINEAR REGRESSION AND CORRELATION 


26.1. For this and the next three chapters we shall be concerned with one or another 
aspect of the relationships between two or more variables. We have already, at various 
points in our exposition, discussed bivariate and multivariate distributions, their 
moments and cumulants ; in particular, we have discussed the properties of bivariate 
and multivariate normal distributions. However, a systematic discussion of the rela- 
tionships between variables was deferred until the theory of estimation and testing 
hypotheses had been explored. Even in this group of four chapters, we shall not be 
able to address ourselves to the whole problem, the more complicated distributional 
problems of three or more variables being deferred until we discuss Multivariate 
Analysis in Volume 3. 


26.2 Even so, the area which we are about to study is a very large one, and it will 
be helpful if we begin by reviewing it in a general way. 

Most of our work stems from an interest in the joint distribution of a pair of random 
variables: we may describe this as the problem of statistical relationship. ‘There is 
a quite distinct field of interest concerning relationships of a strictly functional kind 
between variables, such as those of classical physics ; this subject is of statistical interest 
because the functionally related variables are subject to observational or instrumental 
errors. We call this the problem of functional relationship, and discuss it in Chapter 29 
below. Before we reach that chapter, we shall be concerned with the problem of 
statistical relationship alone, where the variables are not (except in degenerate cases) 
functionally related, although they may also be subject to observational or instrumental 
errors; we regard them simply as members of a distributional complex. 


26.3 Within the field of statistical relationship there is a further useful distinc- 
tion to be made. We may be interested either in the interdependence between a number 
(not necessarily all) of our variables or in the dependence of one or more variables upon 
others. For example, we may be interested in whether there is a relationship between 
length of arm and length of leg in men ; put this way, it is a problem of interdependence. 
But if we are interested in using leg-length measurements to convey information about 
arm-length, we are considering the dependence of the latter upon the former. This 
is a case in which either interdependence or dependence may be of interest. On the 
other hand, there are situations when only dependence is of interest. ‘The relationship 
of crop-yields and rainfall is an example in which non-statistical considerations make 
it clear that there is an essential asymmetry in the situation: we say, loosely, that 
rainfall “‘ causes ”’ crop-yield to vary, and we are quite certain that crops do not affect 
the rainfall, so we measure the dependence of yield upon rainfall. 
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There is no clear-cut distinction in statistical terminology for the techniques appro- 
priate to these essentially different types of problem. For example, we shall see in 
Chapter 27 that if we are interested in the interdependence of two variables with the 
effects of other variables eliminated, we use the method called “ partial correlation,” 
while if we are interested in the dependence of a single variable upon a group of others, 
we use “ multiple correlation.’’ Nevertheless, it is true in the main that the study 
of interdependence leads to the theory of correlation dealt with in Chapters 26-27, while 
the study of dependence leads to the theory of regression discussed in these chapters 
and in Chapter 28. 


26.4 Before proceeding to the exposition of the theory of correlation (largely 
developed around the beginning of this century by Karl Pearson and by Yule), which 
will occupy most of this chapter, we make one final general point. A statistical rela- 
tionship, however strong and however suggestive, can never establish a causal connexion : 
our ideas on causation must come from outside statistics, ultimately from some theory 
or other. Even in the simple example of crop-yield and rainfall discussed in 26.3, we 
had no statistical reason for dismissing the idea of dependence of rainfall upon crop- 
yield: the dismissal is based on quite different considerations. Even if rainfall and 
crop-yields were in perfect functional correspondence, we should not dream of reversing 
the “ obvious” causal connexion. We need not enter into the philosophical implica- 
tions of this; for our purposes, we need only reiterate that statistical relationship, of 
whatever kind, cannot logically imply causation. 

G. B. Shaw made this point brilliantly in his Preface to The Doctor’s Dilemma 
(1906): ‘‘ Even trained statisticians often fail to appreciate the extent to which statistics 
are vitiated by the unrecorded assumptions of their interpreters ... It is easy to 
prove that the wearing of tall hats and the carrying of umbrellas enlarges the chest, 
prolongs life, and confers comparative immunity from disease. ... A university 
degree, a daily bath, the owning of thirty pairs of trousers, a knowledge of Wagner’s 
music, a pew in church, anything, in short, that implies more means and better nur- 
ture . . . can be statistically palmed off as a magic-spell conferring all sorts of privi- 
leges. ... The mathematician whose correlations would fill a Newton with 
admiration, may, in collecting and accepting data and drawing conclusions from them, 
fall into quite crude errors by just such popular oversights as I have been describing.” 

Although Shaw was on this occasion supporting a characteristically doubtful cause, 
his logic was valid. In the first flush of enthusiasm for correlation techniques, it was 
easy for early followers of Karl Pearson and Yule to be incautious. It was not until 
twenty years after Shaw wrote that Yule (1926) frightened statisticians by adducing 
cases of very high correlations which were obviously not causal : e.g. the annual suicide 
rate was highly correlated with the membership of the Church of England. Most of 
these “‘ nonsense ”’ correlations operate through concomitant variation in time, and they 
had the salutary effect of bringing home to the statistician that causation cannot be 
deduced from any observed co-variation, however close. Now, more than thirty years 
later, the reaction has perhaps gone too far: correlation analysis is very unfashionable 
among statisticians. Yet there are large fields of application (the social sciences and 


psychology, for example) where patterns of causation are not yet sufficiently well 
T 
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understood for correlation analysis to be replaced by more specifically ‘“ structural ”’ 
statistical methods, and also large areas of multivariate analysis where the computation of 
what is in effect a matrix of correlation coefficients is a necessary prelude to the detailed 
statistical analysis ; on both these accounts, some study of the subject is necessary. 


26.5 In Chapter 1 (Tables 1.15, 1.23 and 1.24) we gave a few examples of bivariate 
distributions arising in practice. Tables 26.1 and 26.2 give two further examples which 
will be used for illustrative purposes. 

Table 26.1—Distribution of weight and stature for 4,995 women in Great Britain, 1951 
Reproduced, by permission, from Women’s Measurements and Sizes, London, H.M.S.O., 1957 


Weight (4) : 
central values Stature (x): central values of groups in inches 
of groups, 
in pounds 54 56 58 60 62 64 66 68 70 72 74 TOTAL 
278°5 1 1 
Zin) _ 
266°5 1 1 
260°5 1 1 
254°5 - 
248-5 1 1 Zz 
242°5 1 1 
236°5 1 1 
230°5 2 1 3; 
224°5 1 2 1 a 
218°5 1 2 1 1 5 
212°5 2 1 6 1 4 11 
206:°5 2 2 3 z 1 10 
200:5 4 2 6 2 14 
194-5 1 3 7 Z 4 1 Zs 
188-5 1 5 14 eee gS 4 46 
182°5 1 7 12 26 9 5 : ee: 63 
176°5 5 8 18 2 ee es 5 See 2 87 
170°5 5 iis © | 17 eee eee eee ee 112 
164-5 1 a= 32 35 a6 36 = 2155 == S39 132 
158-5 ee ©: 52 42365 2 185 
1525 1 , ee 81 = Se See SS Lis 
146°5 — 4 ae 76 ofS 50 SS 345 
140-5 1 6 = 552 10 4138. - 289. 2-50 3 448 
134-5 15 64 95... 175427 ASS SS 521 
128-5 SS ees Se Se SS 584 
1225 jee ees ees Sec eee 3 ees ee Se 591 
116°5 3 24 108 184 184 50 8 561 
110-5 5. ..33: .119-. 365 124 2 4 472 
104-5 : SS Se 95 35 6 260 
98-5 2S a 45 16 3 159 
92°5 e410 21 9 46 
86°5 1 5 3 9 
80°5 eee 1 4 


tN 


Torat| 5 33 254 813 1340 1454 750 275 56 11 4 4995 
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Table 26.2—Distribution of bust girth and stature for 4,995 women in Great Britain, 1951 
Data from same source as those of Table 26.1 


—— Stature (x): central values of groups, in inches 

of groups, gS ins 

in inches 54 56 58 60 62 64 66 os. - 70-72-44 
56 1 1 
54 1 2 = 
52 1 3 4 1 1 10 
50 1 3 5 4 1 14 
48 1 3 9 Z 6 3 1 30 
46 #43 17 17 7 1 ¥ | 
44 $= 1422 25 50 47 1 4 162 
42 ) a Se 85 ie oe 261 
40 2 2 fe ee ee 479 
38 2-30. 9 Se ae SS 709 
36 G46. 188 317 410. 963-89: 15 = 1 1337 
34 + = 9 Gy 20 376 427 G6 59 = 8 1353 
32 ee 2 2 = a 504 
30 Se. ees Se 25 10 2 71 
28 Z 1 1 4+ 
TotaL| 5 33 254 813 1340 1454 750 275 56 11 4 4995 


For the moment, we treat these data as populations, leaving aside the question of 
sampling until later in the chapter. 

Just as, for univariate distributions, we constructed summarizing constants such as 
the mean, variance, etc., we should like to summarize the relationship between the 
variables, and in particular their interdependence. Summarizing constants for a 
bivariate distribution arise naturally from the following considerations. 

We call the two variables x, y. For any given value of x, say X, the distribution 
of y is called a y-array. The y-array is, of course, the conditional distribution of y 
given that x = X. This conditional distribution has a mean which we write 


Ix = £(9{4), — (26.1) 
which will be a function of X, and vary with it. Similarly, by considering the x-array 
for y = Y, we have 

Hy = E(x| Y). (26.2) 
(26.1) and (26.2) are called the regression curves (or, more shortly, the regressions) of 
y on x and of x on y respectively. 

Although we have done so here for explicitness, we shall not use a capital letter to 
denote the variable being held constant where the context makes the notation E (y|«x), 
E(x«|y) clear. 

Fig. 26.1 and 26.2 show, for the data of Tables 26.1 and 26.2, the means of y-arrays 
(marked by crosses) and of x-arrays (marked by circles). Lines CC’ and RR’ have 
been drawn to fit the array-means as closely as possible for straight lines, in the sense 
of Least Squares—cf. 26.8. These diagrams summarize the properties of a bivari- 
ate distribution in the same way that a mean summarizes a univariate distribution. 
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Since these are grouped data, the only arrays for which we. have information are those 
corresponding to the grouped values of x and y. Conventionally, the y-arrays are 
taken to refer to the central value of the x-group within which they are observed, and 
similarly for the x-arrays. 

We shall study regression in its own right in Chapter 28—here we are using the 
ideas of linear regression mainly as an introduction to a measure of interdependence, 
the coefficient of (product-moment) correlation, though we shall also take the oppor- 
tunity to complete our study of the bivariate normal distribution from the standpoint 
of regression and correlation. 


Covariance and regression 

26.6 It is natural to consider using as the basis of a measure of dependence the 
product-moment ,,, which we have encountered several times already in earlier chapters. 
/411, Which is known as the covariance of x and y, is defined for a discrete population by 


Ma = 2: (%1= Ha) (Vi-My)/n = 2 x,¥i/0— Maley (26.3) 


where n is the number of pairs of values x, y, and w,, “, are the means of x, y. For 
a continuous population defined by 


dF (x,y) = f (x,y) dx dy, (26.4) 
the corresponding expression is (cf. 3.27) 


tar = | | 19m) dF (2,9) 


= E{(*—Mz)(Y—M)} = E(wy)—- E(x) E(y). (26.5) 
If the variates x, y are independent, : 
Lael = Att 0, (26.6) 


as we saw in Example 12.7. By that Example, too, the converse is not generally true : 
(26.6) does not generally imply independence, which requires 


tp, = Q..forall rx s % 0), (26.7) 


For a bivariate normal distribution, however, we know that x,, = 0 for all r+s > 2, 
so that «,, is the only non-zero product-cumulant. ‘Thus (26.6) implies (26.7) and 
independence for normal variables. It may also do so for other specified distributions, 
but it does not in general do so. Example 26.1 gives a non-normal distribution for 
which «,; = 0 implies independence ; Example 26.2 gives one where it does not. 


Example 26.1 
If x and y are bivariate normally distributed standardized variables, the joint charac- 
teristic function of x? and y? is 


Rt sc. 
$60) = pam | [oP {-aq— all 20 i Soay 


+y2{1—-2(1—p®)iu}] } avdy. 
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The integral is, by Exercise 1.5, equal to 


1-2(1—p%)it pert 
x 2(1—p?) 
p 1—-2(1—p2)iu 
so that 
b(t,u) = (1—p*)}i[{1—-2(1—p?) it} {1—2(1—p*)iu} —p?] + 
= [(1—272)(1—2iu)+4p?tu]-. 
We see that 


p(t, 0) = (1—272)-, 

$(0,u) = (1—27u)-, 
so that the marginal distributions are chi-squares with one degree of freedom, as we 
know. By differentiating the logarithm of $(t,u) we find 


= _ [de log d(, u) SS 
fag = it = (it) a (iu) /s 2p*. 
Now when p = 0, we see that 
f(t,u) = $(t,0)¢(0, x), 
a necessary and sufficient condition for independence of x* and y* by 4.16-17. Thus 
/41, = 0 implies independence in this case. 


Example 26.2 
Consider a bivariate distribution with uniform probability over a unit circle centred 
at the means of x and y. We have 


dF (x,y) = dxdy/n, O0< 8+y? < 1, 


| | svar = ~| { wydedy 

fo[joo ts 

= Za J 

1 1 +a-29 

—|x|-y¥* dx =.G, 

wt | E ee 

as is otherwise obvious. But clearly x and y are not independent, since the range of 
variation of each depends on the value of the other. 


whence 


I 


Mi 


I 


Linear regression 
26.7 If the regression of x on y, defined at (26.2), is exactly linear, we have the 
equation 
E(x|y) = %+Biy, (26.8) 
in which we now determine «, and f,. ‘Taking expectations on both sides of (26.8) 
with respect to x, we find 


May = %1+ By My (26.9) 
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If we subtract (26.9) from (26.8), multiply both sides by (y—j,) and take expectations 
again, we find 


E{(*— pez) (y—y)} = Bi E{(y—byr)*}, 


or 

By = Hr /03, (26.10) 
where o2 is the variance of y. Similarly, we obtain from (26.1) 

Bs = Ma1/0; (26.11) 


for the coefficient in an exactly linear regression of y on x. (26.10) and (26.11) define 
the (linear) regression coefficients™ of x on y (B,) and of y on x (82). Using (26.8), 
(26.9) and (26.10), we have 


E(«|y)—Me = Bi(y—Hy) (26.12) 
and similarly 
E(y|)—py = Ba(*— Me) (26.13) 


(26.12) and (26.13) are the linear regression equations. 
We have already encountered a case of exact linear regressions in our discussion 
of the bivariate normal distribution in 16.23. 


Example 26.3 
The regressions of x? and y? on each other in Example 26.1 are strictly linear. For, 
from 16.23, putting o, = o, = 1 in (16.46), we have 
E(x|y) = py 
var (x|y) = 1—p?. 
Thus 3 
E(x*|y) = var (x|y)+ {E(«|y9)? 
= 1—p*+p*y’. 
To each value of y? there correspond values + and —y which occur with equal prob- 
ability. ‘Thus, since F(x*|y) is a function of y? only, 
E(x*| y2) = ${E(x*|y)+£(x?|—y)} = EG@*|y) = 1p? +p? y”, | 
which we may rewrite, in the form (26.12), 
E(x*|y*)—1 = p2(y*—1), 
and the regression of y? on x? is strictly linear. Similarly 
E(y"|x*)=1 = p*(x*—1). 
Since we saw in Example 26.1 that w,, = 2p%, and we know that the variances = 2, 


since these are chi-squared distributions with one degree of freedom, we may confirm 
from (26.10) and (26.11) that p? is the regression coefficient in each of the linear regres- 


sion equations. 


(*)'The notation £,, B. is unconnected with the symbolism for skewness and kurtosis in 
3.31-2; they are unlikely to be confused, since they arise in different contexts. 
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Example 26.4 


In Example 26.2 it is easily seen that E(x|y) = E(y|x) = 0, so that we have 
linear regressions here, too, the coefficients being zero. 


Example 26.5 


Consider the variables x and y? in Example 26.3. We saw there that the regression 
of y? is linear on x?, with coefficient p, and it is therefore not linear on x when p # 0. 
However, since E(x|y) = py we have 


E(x|y?) = ${E(«|y)+E(x|—-y)} = 0, 


so that the regression of x on y? is linear with regression coefficient zero. 


Approximate linear regression: Least Squares 


26.8 Examples 26.3-5 give instances where one or both regressions are exactly 
linear. When the population is an observed and not a theoretical one, however (and 
a fortiori when we have to take sampling fluctuation into account), it is very rare to 
find an exactly linear regression. Nevertheless, as in Fig. 26.1, the regression may be 
near enough to the linear form for us to wish to use a linear regression as an approxima- 
tion. Weare therefore led to the problem of “ fitting ’’ a straight line to the regression 
curve of y on x. 

When there are no sampling considerations involved, the choice of a method of 
fitting is essentially arbitrary, in exactly the same way that, from the point of view of 
the description of data, the choice between mean and median as a measure of location 
is arbitrary. If we are fitting the regression of y on x, it is clearly desirable that in 
some sense the deviations of the points (y, x) from the fitted line should be small if 
the line is to represent them adequately. We might consider choosing the line to mini- 
mize the sum of the absolute deviations of the points from the line, but this gives rise 
to the usual mathematical difficulties accompanying an expression involving a modulus 
sign. Just as these difficulties lead us to prefer the standard deviation to the mean 
deviation as a measure of dispersion, they lead us here to propose that the sum of squares 
of the deviations of the points should be minimized. 

We have still to determine how the deviations are to be taken: in the y-direction, 
the x-direction, or as “‘normal” deviations obtained by dropping a perpendicular 
from each point to the line. As we are considering the dependence of y on x, it seems 
natural to minimize the sum of squared deviations in the y-direction. ‘Thus we are 
led back to the Method of Least Squares: we choose the “ best-fitting ”’ regression 
line of y on x, 


y = a+ Bax, (26.14) 


so that the sum of squared deviations of the m observations from the fitted regression 
line, i.e. 


S = 3 {y.- (ert Bae)} (26.15) 


is minimized. The problem is to determine a, and f,. We have already considered 
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a much more general form of this problem in 19.4. In the matrix notation we used 


there, (26.14) is 
y =(1:x) (53) 


(nx1) (nx2) (2x1) 


where 1 is a (nx 1) vector of units. The solution is, from (19.12), 


(2) = (@ixy dix} diayy 


a 28 yO fay 

a owt - Vary 

= 1 (Puy -UeLxy 
 nixt—(Lx)?\ nixy-ixdty J 


Thus 
i nUuxy-UKBy — pry 
e onext—(SxyP a?” 
just as at (26.11) for the case of exact linearity of regression ; while 
Lx UVy—UxUuxy 
nXix?—(Ux)? 
the equivalent of (26.9). Thus (26.14) becomes 
= dees Bie Ba (%— Mz), 


Ko = 


= by — Bolas 


the analogue of (26.13). 

We have thus reached the conclusion that the calculation of an approximate regres- 
sion line by the Method of Least Squares gives results which are the same as the correct 
ones in the case of exact linearity of regression. 


The correlation coefficient 

26.9 In view of the result of 26.8, we now make our discussion cover the general 
case where regression is not exactly linear. ‘The linear regression coefficients, (26.10) 
and (26.11), are generally the coefficients in approximate regression lines, though on 
occasions these lines may be exact. 

We now define the product-moment correlation coefficient p by 


P = f1i/(41%2); (26.16) 
whence, from (26.10), (26.11) and (26.16), 
p” = By Bo. (26.17) 


p is a symmetric function of x and y, as any coefficient of interdependence should be. 
Since it is a homogeneous function of moments about the means, it is invariant under 
changes of origin and scale. p has the same sign as f, and £,, since all three have 4, 
as numerator and a positive denominator ; when 4,, = 0, p = 0. From (26.17) we 
see that |p| is the geometric mean of | 6,| and | B,|. 
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By the Cauchy—Schwarz inequality 


w= {[Jo-mo-marb <4 f [o-myrar } | [ (v-myrar} = ofo8 


so that 
Oe fed (26.18) 
the upper equality in (26.18) holding (cf. 2.7) if and only if (x—y,) and (y—y,) are 
strictly proportional, i.e. x and y are in strict linear functional relationship. Essen- 
tially, therefore, p is the covariance 4, divided by a factor which ensures that p will 
lie in the interval (—1, +1). 
It may easily be shown that the angle between the two regression lines (26.12) and 


(26.13) is 
6 = arc tan {aes (< — e) (26.19) 


so that as p varies over its range from —1 to +1, 6 increases steadily from 0 to its 
maximum of 47 when p = 0, and then decreases steadily to 0 again. ‘Thus, if and only 
if x and y are in strict linear functional relationship, the two regression lines coincide 
(p? = 1). If and only if p = 0, when x and y are said to be uncorrelated, the regression 
lines are at right angles to each other. 

It may be shown that 


p? = var (%.+6.x)/o% = var(a,+ B,y)/o%, (26.20) 


where “‘ var’? here means simply the calculated variance. ‘The proof of (26.20) is left 
to the reader as Exercise 26.13. 


e as a coefficient of interdependence 

26.10 From 26.6 and Example 26.2 we see that while independence of x and y 
implies “,, = p = 0, the converse does not generally apply. It does apply for jointly 
normal variables, and sometimes for others (Example 26.1). In this lies the difficulty 
of interpreting p as a coefficient of interdependence in general. In fact, we have seen 
that p is essentially a coefficient of linear interdependence, and more complex forms 
of interdependence lie outside its vocabulary. In general, the problem of joint varia- 
tion is too complex to be comprehended in a single coefficient. 

To express a quality, moreover, is not the same as to measure it. If p = 0 implies 
independence, we know from 26.9 that as |p| increases, the interdependence also 
increases until when |p| = 1 we have the limiting case of linear functional relation- 
ship. Even so, it remains an open question which function of p should be used as a 
measure of interdependence : we see from (26.20) that p? is more directly interpretable 
than p itself, being the ratio of the variance of the fitted line to the overall variance. 
Leaving this point aside, p gives us a measure in such cases, though there may be better 
measures. On the other hand, if p = 0 does not imply independence, it is difficult to 
interpret p as a measure of interdependence, and perhaps wiser to use it as an indicator 
rather than as a precise measure. In practical work, we would recommend the use 
of p as a measure of interdependence only in cases of normal or near-normal variation. 
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Computation of coefficients 


26.11 From the definitions at (26.10), (26.11) and (26. 16) we see that the linear 
regression coefficients and the correlation coefficient require for their computation the 
two variances and the covariance m,,. ‘The calculation of variances was discussed in 


2.19 and Example 2.7.. The covariance is calculated similarly, using the identity stated 
in (26.3), 


a = = (%¢— Ma) (Vi- My) /h = a X,Yi/M— Ly bey 


= 


= b x ,y,/n—- (3 p> x)( 3 ys) / 
i=1 i=1 


For convenience, we often take arbitrary origins a, b, for x and y respectively. ‘Then 


Ha, = B(x—a)(y—b)/n— {X(x—a) }{X(y—4) j/n" (26.21) 


identically in a and b. In other words, m4, is invariant under changes of origin, as it 
must be since it is a product-moment about the means. (26.21) holds if we put 
(x—a) =(y-—6), when it reduces to (2.21) for the calculation of variances. We usually 
also find it convenient to take an arbitrary unit wu, for x and another arbitrary unit u, 
for y. It is easy to see that the effect of this is to divide u,, by u,u,, of by uz, and 
o2 by uw. Thus f, is multiplied by a factor u,/u,, B, by a factor u,/u,, and p is quite 
unaffected by a change of scales. 

To summarize, 411, 07 and of are invariant under changes of origin, so B,, 8, and p 
are. If different arbitrary scale factors are introduced, for computational purposes, 
6, and f, require adjustment by the appropriate ratio ; if the same scale factor is used 
for each variable, f, and f, are unaffected. p is unaffected by any scale change. 


Example 26.6. Computation of coefficients for grouped data 
For grouped data, such as those in Table 26.1, we choose the group-width of each 
variable as the working unit for that variable (if the groups of a variable are of unequal 
width, we usually take the smallest group width). We also choose a working origin 
for each variable somewhere near the mean, estimated by eye. ‘Thus we take the 
x-origin in Table 26.1 at 64, the centre of the modal frequency-group, the marginal 
distribution of x being near symmetry ; the y-origin is placed at 134-5, since the mean 
is likely to lie appreciably above the modal frequency group (122-5) for a very skew 
distribution like the marginal distribution of y. ‘The group widths (2 and 6) are taken 
as working units. The sum of products, Xwxy, is calculated by multiplying each 
frequency in turn by its “co-ordinates” in the table in the arbitrary units. Thus 
the extreme “south-eastern ”’ entry in the table, the frequency 4, for which x = 68, 
y = 110-5, is multiplied by (+2)(—4) = —8, contributing —32 to the sum. - We 
find 
2X = — 203, Ly = —1,400, 
= 10161, xy? = 70,802, 


uxy = +8,786, 
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giving 
hae 7s .2+64 = 63-06, 
“a — Fagg 6+ 1345 = 132-82, 
a "4555" = (7555) } x22 =7-25, 
Ae {7503-~ (4995) $< = 507-46, 
wa sas (4998) (~ 4998) ¢*2*6 = +19-52, 
whence 


p = £11 — 0-322, 


019% . 
By = fy;/oz = 90-0385, 
B. = My1/o; = 2-692. 


The (approximate) linear regression equations are: 


x ony: x— 63-06 = 0-0385 (y— 132-82) or 
x = 0:0385y + 57-95, 
yon x: y— 132-82 = 2-692 (x— 63-06) or 


y = 2:692x— 36-96. 
These lines are drawn in on Fig. 26.1 (page 282) as RR’ and CC’ respectively. 


Example 26.7. Computation of coefficients for ungrouped data 
Table 26.3 shows the yields of wheat and of potatoes in 48 counties of England 
in 1936. For ungrouped data such as these, it is rarely worth taking an arbitrary 
origin and unit for calculations. Using the natural origins and units, we find 
Zx= 758:0,  g, = 15-792, By = fyi /o3 = 0-612, 
Bey 2 29 ey 6:065, By = 11/07 = 0-078, 


Sx? = 12,170-48, of = 4174, p= fy;/(010,) = 0-219. 
Sy? = 1,791-03, of = 0:534, 
Exy = 4,612-64, wi, = 0-327, 


The (approximate) linear regression equations are therefore : 
Regression of x on y: x—15-792 = 0-612(y— 6-065) 
Regression of y on x: y— 6-065 = 0-078 (x— 15-792) 


The data and the regression lines are shown diagrammatically in Fig. 26.3, one point 
corresponding to each pair of values (x,y). A diagram such as this, on which all points 
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Table 26.3—Yields of wheat and potatoes in 48 counties in England in 1936 


Wheat Potatoes | Wheat Potatoes 
cwt tons cwt to 
County = acre) eS acre) County an acre) a mS 
x x ¥ 

Bedtordsnwe 3, 16-0 = Northamptonshire .. .. 14-3 4-9 
Huntingdonshire .. .. 16-0 6-6 Peterborqugh .,... .. 14-4 5-6 
Cambridgeshire eee 85 16-4 6:1 Buckinghamshire .. .. 19°2 6-4 
aa Sat 5 20°5 5-5 Oxtordehive= =... 635 su2 14:1 69 
Suffolk, West Say Se 18-2 6-9 Warwickshire ...... ... 15-4 5-6 
Suffolk, a ee 16°3 6-1 Shropshire Se 16°5 6-1 
Essex .. Se 17:7 6-4 Worcestershire ee 14-2 5-7 
Hertfordshire Se as 153 6-3 Gloucestershire ee 132 5-0 
WHoGwSex 3. 4. 16°5 7°8 Wawe.  e 13-8 6°5 
Norfolk... ae 16-9 8-3 Herefordshire .. .. .. 14-4 6:2 
Lincs (Holland) — ae 21-8 SF Somersetshire .. .. .. 13-4 52 

» (Kesteven) ..... 45:5 6-2 Dorsetsnie =... 4 es ti-2 6-6 

» (Lindsey) 15-8 6-0 Se —— a 14-4 5:8 
Yorkshire — Riding) . 16:1 6-1 Cornwall SF E35 15 -4 6 +3 
<2 ; 18°5 6:6 Northumberland Ss 18-5 6:3 
Surrey aes aes 12-7 4°8 Durham .. 16-4 5°8 
maesee, Fast 5. = 15-7 4-9 Yorkshire (North Riding) 17-0 5-9 
Sussex, West .. .. .. 14:3 Be = (West Stee ee ee 
Mere... gs | 5°5 Cumberland .. jee: sees oS 
Hampshire cee es ee 6-7 Westmorland .. = 6 SF 
Isle of Wight Pee 12-0 6°5 Lancashire ee 19-2 7:2 
Nottinghamshire .. ..| 15:6 a2 So ae +7 J 6°5 
Leicestershire ..  .. .. | .15°8 SF Derbyshire ee 15:2 5-4 
ee ee I3 Siioragnite — ek 17 +1 6°3 


Co 


> “wy y 


Potato yield (tons per acre), 


Loa) 


/0 l2 /4- : /8 20 an 
Wheat yield (cwE per acre), x 


Fig. 26.3—Data of Table 26.3, with regression lines 
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are plotted, is called a scatter diagram: its use is strongly recommended, since it con- 
veys quickly and simply an idea of the adequacy of the fitted regression lines (not 
very good in our example). Indeed, a scatter diagram, plotted in advance of the 
analysis, will often make it clear whether the fitting of regression lines is worth while. 


Sample coefficients: standard errors 


26.12 We now turn to the consideration of sampling problems for correlation and 
regression coefficients. As usual, we observe the convention that a Roman letter 
(actually italic) represents a sample statistic, the Greek letter being the population 
equivalent. ‘Thus we write 


1 = sy FES = 
by = my,/s; = ~2(@—-8)(y-H) /5B(9-IN 


b, 


ae +E (4) cae / “E(e— 2) (26.22) 


——- 


yi (S480) = Be) (9-3) /{ E48) EOI 


for the sample regression coefficients and correlation coefficient, the summations now 
being over sample values. Just as for the population coefficients f,, 2, p, we may 
simplify these expressions for computational purposes to 


p, = SXV=(24)(Zy)/n 
Ly?—(Ly)?/n 


yp - UXY— (2%) (Sy)/n (26.23) 
: Ux?—(Ux)jt/n ” 


: _ sey (2x)(2y)/n 


| [(2 a8 (Sa) /n} {Ey (Sy)/a} IP 
Just as before, we have —1 <r < +1. 


26.13 ‘The standard errors of the coefficients (26.22) are easily obtained. In fact, 
we have already obtained the large-sample variance of r in Example 10.6, where we 
saw that it is, in general, an expression involving all the second-order and fourth-order 
moments of the population sampled. In the normal case, however, we found that it 
simplified to 

varr = (1—p?)?/n, (26.24) 
though (26.24) 1s of little value in practice since the distribution of r tends to nor- 
mality so slowly (cf. 16.29): it is unwise to use it for x < 500. The difficulty is of 
no practical importance, since, as we saw in 16.33, the simple transformation of r, 


a (7) 2 (26.25) 
is for normal samples much more closely normally distributed with approximate mean 


Bi = tie (7*4) (26.26) 
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and variance approximately 


var 2 = —— (26.27) 


n—3’ 
independent of p. Form > 50, the use of this standard error for z is adequate ; closer 


approximations are given in 16.33. 
For the sample regression coefficient of y on x, 


= 2 
b, = my,/s3, 


the use of (10.17) gives, just as in Example 10.6 for 7, 
var b, = 3) or ra Varta) Peeve 


2 2 4 2 
Oj May Oj 119 
Substituting for the variances and covariance from (10.23) and (10.24), this becomes 
2 
var by = “() Han fise Zas (26.28) 
N\ 9 Ma 1 1 


For a normal parent population, we substitute the relations of Example 3.17 and obtain 


Vatu. = = (Se) ‘ena 0103 , 30r_ Op “3 


n\ oj My Of M11} 
(ux) (4a_1} 
ala) Ga 
1 Mi 
1 o3 
seis oht tp?) 26.2 
7 (1p) (26.29) 


Similarly, for the regression coefficient of x on y, 
Lg 
b, =.— (1— p’). - 26. 
var by oan p”) (26.30) 


The expressions (26.29) and (26.30) are rather more useful for standard error purposes 
(when, of course, we substitute s{, s3 and r for of, of and p in them) than (26.24), for 
we saw at 16.35 that the exact distribution of b, is symmetrical about f,: it is left to 
the reader as Exercise 26.9 to show from (16.86) that (26.29) is exact when multiplied 
by a factor n/(n—3), and that the distribution of b, tends to normality rapidly, its 
measure of kurtosis being of order 1/n. 


The estimation of e in normal samples 

26.14 The sampling theory of the bivariate normal distribution was developed in 
16.23-36 : we may now discuss, in particular, the problem of estimating p from a 
sample, in the light of our results in the theory of estimation (Chapters 17-18). 

In 16.24 we saw in effect that the Likelihood Function is given by (16.52), which 
contains the observations only in the form of the five statistics #, 7, s?, s3, r. These 
are therefore a set of sufficient statistics for the five parameters 41, 2, oj, 03, p, and 
their distribution is complete by 23.10. Further, (16.52) makes it clear that even if 
all four other parameters are known, we still require this five-component sufficient 
statistic for p alone. 

In Chapter 18 we saw that the Maximum Likelihood estimator of p takes a different 
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form according to which, if any, other parameters are being simultaneously estimated: 
the ML estimator is always a function of the set of sufficient statistics, but it is a different 
function in different situations. When p alone is being estimated, the ML estimator 
is the root of a cubic equation (Example 18.3); when all five parameters are being 
estimated, the ML estimator is the sample correlation coefficient r (Example 18.14). 
In practice, the latter is by far the most common case, and we therefore now consider 
the estimation of p by r or functions of it. 


~ 26.15 ‘The exact distribution of r, which depends only upon p, is given by (16.61) or, 
more conveniently, by (16.66). Its mean value is given by the hypergeometric series 
(16.73). Expanding the gamma functions in (16.73) by Stirling’s series (3.64) and 
taking the two leading terms of the hypergeometric function, we find 


E(1) = Pf! 5) 0(-9), (26.31) 


Thus r is a slightly biassed estimator of p when 0 # p? 4 1. The bias is generally 
small, but it is interesting to inquire whether it can be removed. 


26.16 We may approach the problem in two ways. First, we may ask: is there 


a function g(r) such that 
Etg(r)} = g(r) (26.32) 


holds identically in p? Hotelling (1953) showed that if g is not dependent on 2, g(r) 
could only be a linear function of arcsinr, and Harley (1956-7) showed that in fact 


E(arcsinr) = arcsinp, (26.33) 
a simple proof of Harley’s result being given by Daniels and Kendall (1958). 


26.17 The second, more direct, approach, is to seek a function of r unbiassed for 
p itself. By Hotelling’s result in 26.16, this function must involve n. Since r is a 
function of a set of complete sufficient statistics, the unbiassed function of r must be 
unique (cf. 23.9). Olkin and Pratt (1958) found the unbiassed estimator of p, say 
r,, to be the hypergeometric function 


r, = + F[}, 3,2 (n—2),(1—7*)] (26.34) 
which, expanded into series, gives 
= Se ee a = 
— r{l+ Rant mee tae ). (26.35) 


No term in the series is negative, so that 
| [rw | 2 |r|, 
the equality holding only if r? = 0 or 1. Nevertheless, since F'(3,$,3(n—2),0) = 1 
and 7,, is an increasing function of r, we have r? < rj < 1. 
Evidently, the first correction term in (26.35) is counteracting the downward bias 
of the term in 1/n in (26.31). Olkin and Pratt recommend the use of the two-term 


expansion 
——— 1 = r? 
oe: {1 trap (26.36) 
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The term in braces in (26.36) gives 7,/r within 0-01 for m > 8 and within 0-001 for 
n > 18, uniformly in 1. 

Olkin and Pratt give exact tables of 7, for n = 2(2)30 and |r| = 0(0-1)1 which 
show that for n > 14, |r,| never exceeds |r| by more than 5 per cent. 

Finally, we note that as appears only in the denominators of the hypergeometric 
series, 7,,—> 7r as n—> ©, so that it has the same limiting distribution as 7, namely 
a normal distribution with mean p and variance (1 —p?)?/n. 


Confidence limits and tests for e 

26.18 For testing that p = 0, the tests based on r are UMPU (cf. Exercise 31.21) ; 
this is not so when we are testing a non-zero value of p. However, if we confine our- 
selves to test statistics which are invariant under changes in location and scale, one- 
sided tests based on r are UMP invariant, as Lehmann (1959) shows. 

For interval estimation purposes, we may use F.. N. David’s (1938) charts, described in 
20.21. By the duality between confidence intervals and tests remarked in 23.26, these 
charts may also be used to read off the values of p to be rejected by a size-« test, i.e. 
all values of p not covered by the confidence interval for that«. F.N. David (1937) has 
shown that this test is slightly biassed (and the confidence intervals correspondingly 
so). This may most easily be seen from the standpoint of the z-transformation standard- 
error test given in 26.13: if the latter were exact, and z were exactly normal with 
variance independent of p, the test of p would simply be a test of the value of the mean 
of a normal distribution with known variance, and we know from Example 23.11 that 
if we use an “ equal-tails”’ test for this hypothesis, it is unbiassed. Since 2 is a one-to- 
one function of 7, the “ equal-tails’”’ test on r would then also be unbiassed. ‘Thus the 
slight bias in the ‘‘ equal-tails ’’ r-test may be regarded as a reflection of the approximate 
nature of the z-transformation. 

Exercise 26.15 shows that the LR test is based on r, but is not ‘‘ equal-tails ” except 
when testing p = 0. 


26.19 Alternatively, we may make an approximate test using Fisher’s 2-trans- 
formation, the simplest results for which are given in 26.13: to test a hypothetical 
value of p we compute (26.25) and test that it is normally distributed about (26.26) with 
variance (26.27). A one- or two-tailed test is appropriate according to whether the 
alternative to this simple hypothesis is one- or two-sided. 

In the same way, we may use the g-transformation to test the composite hypothesis 
that the correlation parameters of two independently sampled bivariate normal popula- 
tions are the same. For if so, the two transformed statistics z,, 2, will each be dis- 
tributed as in 26.13, and (z, — 2.) will have zero mean and variance 1 /(n, —3)+1/(n,—3), 
where 7, and n, are the sample sizes. Exercises 26.19-21 show that (z,—2,) is exactly 
the Likelihood Ratio statistic when n, = m,, and approximately so when n, # ng. _ In 
either case, however, the test is approximate, being a standard-error test. 

The more general composite hypothesis, that the two correlation parameters py, p> 
differ by an amount A, cannot be tested in this way. For then 


1+p l+p (1+p1\ (1=p 
Blase 1 a a) = 
(2, = og (72 pa) 5) log (22 pte) = 4 81 (7 P| fess =) 


U 
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is not a function of |p,—p,| alone. ‘The z-transformation could be used to test 
H,: 1+p, oe a(7=22) 
bet oe 1—p,» 
for any constant a, but this is not a hypothesis of interest. Except in the very large 
sample case, when we may use standard errors, there seems to be no way of testing 
H,:pi—p2 = A, for the exact distribution of the difference 7,;—r, has not been 
investigated. 


Tests of independence and regression tests 

26.20 In the particular case when we wish to test p = 0, i.e. the independence of 
the normal variables, we may use the exact result of 16.28, that 

t= {(n—2)r?/(1—r*)} (26.37) 

is distributed in ‘‘ Student’s” ¢-distribution with (n—2) degrees of freedom. 7? is 
essentially the LR test statistic for p = 0—cf. Exercise 26.15—and this is equivalent 
to an ‘‘ equal-tails’ test on 7. 

Essentially, we are testing here that ,, of the population is zero, and clearly this 
implies that the population regression coefficients f,, B, are zero. Now in 16.36, we 


showed that 
_ 9.) silm—2) \3 
= OP) aa] ae 

has the “ Student’s ” distribution with (n—2) degrees of freedom. ‘Thus E(d,) = By. 
When f, = 0, (26.38) is seen to be identical with (26.37). ‘Thus the test of independ- 
ence may be regarded as a test that a regression coefficient is zero, a special case of 
the general test (26.38) which we use for hypotheses concerning f. It will be noted 
that the exact test for any hypothetical value of f, is of a much simpler form than that 
for p. 

We shall see in Chapter 31 that tests of independence can be made without 
any assumption of normality in the parent distribution. 


Correlation ratios and linearity of regression 

26.21 In 26.8 we discussed the fitting of approximate regression lines in cases 
where regressions are not exactly linear. We can make further progress in analysing 
the linearity of regression. Consider first the case of exact linear regression of x on y, 
when (26.12) holds. Squaring (26.12) and taking expectations with respect to y, we 


have 


var {E(s|y)} = E{(E(*|y)—E(a))*} = Bod = wh /o8 (26,39) 
We now define the correlation ratio of x on y, ,, by 
my = var{E(x|y)}/o1, (26.40) 


the ratio of the variance of x-array means to the variance of x. Unlike p, j is evidently 
not symmetric in x and y. (26.40) implies that 7j is invariant under permutation of 
the x-arrays, since var { E(x|)} does not depend on the order in which the values of 
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E(x|y) occur. ‘This is in sharp contrast to p, which is sensitive to any change in the 
order of arrays. 
From (26.39) we see that if the regression is exactly linear 
mt = 'ir/(0102) = p?. 
Now consider the general case where the regression is not necessarily exactly linear. 
We have 
oi = El {w—E(x)}P = E[({x-E(x|y)}+ {E@|y)—-£(@)5)7] 
= E[{x—E(«|y)P]+E[{E(«|y)—£();"] 
= E[{x—E(w|y)}?]+var{Z@ly)}, (26.41) 
since the cross-product term 


2E[{x—E(x|y)}{£(x|y)—£(x) 5] = eee! (x— E(x|y)5] = 0. 
Thus, from (26.40) and (26.41), we have 
6g 21. (26.42) 
and 7? = 1 if and only if E[{x—E(x«|y)}?] = 0, i.e. every observation lies on the 
regression curve, so that x and y are strictly functionally related. Further, 


p?/nt = Mi /lozvar{ E(x|y) 5]. (26.43) 
By the Cauchy—Schwarz inequality, 
a = we es) = ee SO ee 


< So tee = ozvar{ E(x] y)}, (26.44) 


the equality holding if and only if {y—E(y)} is proportional to {£(x|y)—E(«)}, 
ie. if E(x|¥) is a strict linear function of y. Thus, from (26.43) and (26.44), 
p?/nj < 1, (26.45) 
the equality holding only when the regression of x on y is exactly linear. Hence, 
from (26.42), (26.45) and (26.18), we finally have the inequalities 
Cap a7, & 4. (26.46) 
We may summarize our results on the attainment of the inequalities in (26.46), 
given in 26.9-10 and in this section, as follows: | 
(a) p? = 0 if, but not only if, x and y are independent ; 
(b) p? = 7? = 1 if, and only if, « and y are in strict linear functional relationship ; 
(c) p? < 4% = 1 if, and only if, x and y are in strict non-linear functional rela- 
tionship ; 
(d) p? = 7? < 1 if, and only if, the regression of x on y is exactly linear, but there 
is no functional relationship ; 
(ec) p? < 7% < 1 implies that there is no functional relationship, and some non- 
linear regression curve is a better “ fit”? than the “ best” straight line, for 
(26.20) and (26.40) then imply that var{E(x|y)} > var(%,+/1y), so that 
the array means are more dispersed than in the straight-line regression most 
nearly “fitting”? them. (Of course, there may be no better-fitting semple 
regression curve.) 
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Since 7, takes no account of the order of the x-arrays, it does not measure any par- 
ticular type of dependence of x on y, but the value of 7{—p? is an indicator of non- 
linearity of regression : it is important to remember that it is an indicator, not a measure, 
and in order to assess its importance the number of arrays (and the number of observa- 
tions) must also be taken into account. We discuss this matter in 26.24. 


26.22 Similarly, we define, for the regression of y on x, the correlation ratio 
nz = var{ E(y|«x)}/o3, (26.47) 


and again 

0O<pP<ye< il. ; 
Since 77 = 1 if and only if there is a strict functional relationship, 7{ = 1 implies 
nz = 1 and conversely. In general, both squared correlation ratios exceed p*, but we 
shall have 7? = p? < 73 if the regression of x on y is linear while that of y on x Is not, 
as in the following Example. 


Example 26.8 

Consider again the situation in Example 26.6. ‘The regression of x on y* was linear 
with regression coefficient 0, so that the correlation between x and y? is zero also. Since 
we found E(x| y?) = 0, it follows that var{ E(x|y?)} = 0 also, so that the correlation 
ratio of x on y? is 0, as it must be, since the correlation coefficient is zero and the regres- 
sion linear. 

The regression of y? on x was not linear: we found in Example 26.3 that 


E(y2| x) = 1+p2(x?—1) 


var { E(y?|x)} = E[{p?(x*—1)}?] = pt E[{x?—-1}?] = 2p% 
and o2 = 2, so that the correlation ratio of y? on x is p*, which always exceeds the cor- 
relation coefficient between x and y?, which is zero, when p # 0. 


so that 


When correlation ratios are being calculated from sample data, we use the observed 
variance of array means and the observed variance in (26.40) and (26.41), properly 
weighted, obtaining for the observed correlation ratio of x on y 

k 

x n,(«;— x)? Un; x? —n xX 

i=1 se 

ko om Dx — nx’ 
2; a (x,;—%)? ‘tj 0 
i=1 j=1 
where *, is the mean of the 7th x-array, and n, the number of observations in the array, 
there being k arrays. A similar expression holds for e3, the observed correlation ratio 
of y on x. As for populations, 


O<crr< ee <li, ¢= 4 2 (26.49) 


Com 


(26.48) 


Example 26.9. Computation of the correlation ratio 


Let us calculate the correlation ratio of y on x for the data of Table 26.1, which 
we now treat as a sample. ‘The computation is set out in Table 26.4. 
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Table 26.4 
Stature Mean weight in 

(x) array (Vi) Fe ni niyi? 

54 92-50 8,556°25 5 42,781-25 
56 111-41 12,412-19 33 409,602:°27 
58 122-05 14,896-20 254 3,783 ,634-°80 
60 124-43 15,482-82 813 12,587,532-66 
62 130-22 16,957-25 1340 22,422;115°00 
64 134-59 18,114-47 1454 26,338,439°-38 
66 140-48 19,734-63 750 14,800,972-50 
68 146-37 21,424-18 275 5,891,649-50 
70 157-32 24,749-58 56 1,385,976-48 
re) 163-41 26,702°83 11 fossa. ks 
74 179-50 32,220-25 4 128,881-00 


n = 4995 88,385,915:97 


In Example 26.6 we found the mean of y to be 
Pp = 13232 
and the variance of y to be 507-46. Thus, from (26.48), the correlation ratio of y 
on x 1s 


oS 88,385,915-97 — 4,995 (132-82)2 
4,995 x 507-46 
_ 88,385,915-97 — 88,117,544-25 
2,534,762-70 
_ 268,359-73 
2,534,762-70 
This is only slightly greater than the squared correlation coefficient 
ry? = (0-322)? = 0-104. 
Fig. 26.1 shows that the linear approximation CC’ to the regression is indeed rather 
good. 


= 0-106. 


Testing correlation ratios and linearity of regression 


26.23 We saw in 26.21 that 7? = p? indicates that no better regression curve than 
a straight line can be found, and hence that a positive value of 7{—p? is an indicator 
of non-linearity of regression. Now that we have defined the sample correlation 
ratios, e?, it is natural to ask whether the statistic (e?—77) will provide a test of the 
linearity of regression of x on y. In the following discussion, we take the opportunity 
to give also a test for the hypothesis that 77 = 0 and also to bring these tests into rela- 
tion with the test of p = 0 given at (26.37). ‘These problems were first solved by 
R. A. Fisher. 

The identity 


nsi =nsir?+ns}(ef—r?)+ns}(1—e), (26.50) 


has all terms on the right non-negative, by (26.49). Since 


k Mm 
sae > { (*;—*)—b,(¥;—-7) }” = Xd (#;,—#)? —7? XD (x; — *)?, 
j tj 


i=1 j=1 a 
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(26.50) may be rewritten in x as 
ij ij ij ij 


Now (26.51) is a decomposition of a quadratic form in the x,; into three other such 
forms. We now assume that the y; are fixed and that all the x,; are normally distri- 
buted, independently of each other, with the same variance (taken to be unity without 
loss of generality). We leave open for the moment the question of the means of the w,;. 

On the hypothesis H, that every x;; has the same mean, i.e. that the regression 
curve is a line parallel to the y-axis, we know that the left-hand side of (26.51) is dis- 
tributed in the chi-squared form with (n—1) degrees of freedom. It is a straight- 
forward, though tedious, task to show that the quadratic forms on the right of (26.51) 
have ranks 1, (k—2) and (n—k) respectively. Since these add to (n—1), it follows 
from Cochran’s theorem (15.16) that the three terms on the right are independently 
distributed in the chi-squared form with degrees of freedom equal to their ranks. By 
16.15, it follows that the ratio of any two of them (divided by their ranks) has an F 
distribution, with the appropriate degrees of freedom. We may use this fact in two 
ways to test Hy: 

(a) The ratio of the first to the sum of the second and third terms, divided by their 
ranks, 

7 
(1=r) /(n—2) 

suffixes denoting degrees of freedom. 

This, it will be seen, is identical with the test of (26.37), since t7_» = Fi,n_2 by 
16.15. We derived it at 16.28 for a bivariate normal population. Here we are taking 
the y’s as fixed and the distribution within each x-array as normal. 

(b) ‘The ratio of the sum of the first and second terms to the third, divided by their 
ranks, 


iS Fy, n—2)9 (26.52) 


e1/(k—1) 
(1—e;)/(n—R) 
For both tests, large values of the test statistic lead to the rejection of Hp. 
The tests based on (26.52) and (26.53) are quite distinct and are both valid 
tests of Hy, but (26.52) essentially tests p? = 0 while (26.53) tests 7{ = 0. If the alter- 
native hypothesis is that the regression of x on y is linear, the test (26.52) will have 
higher power ; but if the alternative is that the regression may be of any form other 
than that specified by Hy, (26.53) is evidently more powerful. It is almost universal 
practice to use (26.52) in the form of a linear regression test (26.20), but there certainly 
are situations to which (26.53) is more appropriate. We discuss the tests further in 
26.24, but first discuss the test of linearity of regression. 


is fy n—ke (26.53) 


26.24 If the x,; do not all have the same mean, the left-hand side of (26.51) is 
no longer a y;_,. However, if we take the first term on the right over to the left, 
we get 

nsi(1—r?) = nasi (e{—r*)+nsi (1 ~e). | (26.54) 
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Since 
nsi(1—r?) = ea raitehet (a, +5, 94) }?, 


hei sum of squared residuals from the fitted linear regression, we see that on the hypo- 
thesis H, that the regression of x on y is exactly linear, and distributions within arrays 
are aptmal as before, ns?(1—,7?) is distributed in the chi-squared form with (n— 2) 
degrees of freedom, one degree of freedom being lost for each parameter fitted in the 
regression line (cf. 19.9). The ranks of the quadratic forms on the right of (26.54) 
are (k—2) and (n—k) as before, and they are therefore independently distributed in 
the chi-squared form with those degrees of freedom. Hence their ratio, after division 


by their ranks, 
(e:—1°)/(R—2) 


(1—e1)/(n—) 
(26.55) may be used to test Hj, the hypothesis of linearity of regression, H being re- 
jected for large values of the test statistic. Again, we have made no assumption about 
the Nije 

Thus our intuitive notion that (e?—r?) must afford a test of linearity of regression 
is correct, but (26.55) shows that the test result will be a function of (1—e{), k and n, 
so that a value of (e?—7?) alone means little. 

All three tests which we have discussed in this and the last section are LR tests of 
linear hypotheses, of the type discussed in the second part of Chapter 24. For example, 
the hypothesis that all the variables x,; have the same mean may be regarded in two 
ways : we may regard them as lying on a straight line which has two parameters, and 
test the hypothesis that the line has zero slope, which imposes one constraint on the 
two parameters. In the notation of 24.27-8, k = 2 and r = 1, so that we get an F- 
test with (1,—2) degrees of freedom: this is (26.52). Alternatively, we may consider 
that the k array means are on a k-parameter curve (a polynomial of degree (k—1), say), 
and test the hypothesis that all the polynomial’s coefficients except the constant are 
zero, imposing (k—1) constraints. We then get an F-test with (k—1,u—k) degrees 
of freedom: this is (26.53). Finally, if in this second formulation we test the hypo- 
thesis that all the polynomial coefficients except the constant and the linear one are 
zero, so that the array means lie on a straight line, we impose (k—2) constraints and 
get an F-test with (k—2,n—k) degrees of freedom: this is (26.55). 

It follows that for fixed values of y,; the results of Chapter 24 concerning the power 
of the LR test, based on the non-central F-distribution, are applicable to these tests, 
which are UMP invariant tests by 24.37. However, the distributions in the bivariate 
normal case, which allow the y, to vary, will not coincide with those derived by holding 
the y,; fixed as above, except when the hypothesis tested is true, when the variation of 
the y; is irrelevant (as we shall see in 27.29). For example, the distribution of 7? 
obtained from the non-central F-distribution for (26.52) does not coincide with-the 
bivariate normal result obtainable from (16.61) or (16.66). ‘The power functions of 
the test of p = 0 are therefore different in the two cases, even though the same test 
is valid in each case. For large n, however, the results do coincide: we discuss this 
more generally in connexion with the multiple correlation coefficient (of which 1? is 
a special case) in 27.29 and 27.31. 


iS F,_ 2, n—ke (26.55) 
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Intra-class correlation 


26.25 ‘There sometimes occur, mainly in biological work, cases in which we require 
the correlation between members of one or more families. We might, for example, 
wish to examine the correlation between heights of brothers. The question then 
arises, which is the first variate and which the second? In the simplest case we might 
have a number of families each containing two brothers. Our correlation table has 
two variates, both height, but in order to complete it we must decide which brother 
is to be related to which variate. One way of doing so would be to take the elder 
brother first, or the taller brother; but this would provide us with the correlation 
between elder and younger brothers, or between taller and shorter brothers, and not 
the correlation between brothers in general, which is what we require. 

The problem is met by entering in the correlation table both possible pairs, i.e. 
those obtained by taking each brother first. If the family, or, more generally, the class, 
contains k members, there will be k(k—1) entries, each member being taken first in 
association with each other member second. If there are p classes with k,,ko,...,k, 


members there will be » k;(k;-—1) = N entries in the correlation table. 
i=1 


As a simple illustration consider five families of three brothers with heights in 
inches respectively; 69, 70, 72> 70,71, 72> 71, 72, 72+ OB, 7 ee, 
There will be 30 entries in the table, which will be as follows: 


Table 26.5 
Height (inches) 
| 
68 69 70 71 72 73 "TOTALS 

68 - ~ 2 - - ~ 2 

69 - _ 1 - 1 - 2 
eg SS : 
S| 50 2 1 2 1 2 - 8 
c) 71 - = 1 ~ = 1 6 
v — 
am 

72 ~ 1 2 4 2 1 10 

73 - -- - 1 1 - 2 

TOTALS Z 2 8 6 10 2 30 


Here, for example, the pair 69, 70 in the first family is entered as (69, 70) and (70, 69) 
and the pair 72, 72 in the third family twice as (72, 72). 
The table is symmetrical about its leading diagonal, as it evidently must be. We 
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may calculate the product-moment correlation coefficient in the usual way. We find 
0-516 

go? = of = 1-716, 4,, = 0-516 and hence p= SS = 0-301. 

A correlation coefficient of this kind is called an intra-class correlation coefficient. 
It can be found more directly as follows : 

Suppose there are p classes with variate-values %1),...,%14,3 X12) +++ )X2k3 +33 
Xp1y +++ yXpkp- In the correlation table, each member of the 7th class will appear k;—1 
times (once in association with each other member of its class), and thus the mean of 


each variate is given by 
1 
eS 
and the variance of each variate by 


Pp kt 


Pp ki 
pa As ae 2s Xejy 
| j=1 


The covariance 1s 


12 & 
Pu = No, Foie (x45— Le) (%ia— 2) 


j-wl 


= = E (0 LB) (*y— w)- = E(u -u)} 


+i 22 
= WE) Eeo-] _2x(4y-h) 
= Fe BAW EE (es) 
where yu; is the mean of the ith class. Thus we have for the correlation coefficient 
D Ri (us— pw)? — DE (x55 — 
i ij 


SC mm ane 
i j 
If k; = k for all 2, (26.56) simplifies to 
oe Rp oj,—kpo* — 1 RO, - 
r - Digest = ee 1), eo} 


i? 
where o%, is the variance of class means, D (u;—m)?. 
t=1 


To distinguish the intra-class coefficient from the ordinary product-moment cor- 
relation coefficient p, we shall denote it by p; and sample values of it by 7;. 


Example 26.10 

Let us use (26.57) to find the intra-class coefficient for the data of Table 26. i 
With a working mean at 70 inches, the values of the variates are —1, 0,2; 0, 1, 2; 
i, 2,2; —2, 6 1, 2, & 
37 386 


Hence a = ae{(- bo ee 15” ant a" 505" 
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The means of families, ,, are 


51s: 2. = 


and their deviations from wu are 


—8 2 12 —23 17 


i 


ti s-33 1030 
ae te es a ee 
gee 31 (45) Lf 1125 


2 ae 


2) 1985 386 


a result we have already found directly in 26.25. 


Thus 


Hence, from (26.57), 


-1} = 0-301, 


26.26 Caution is necessary in the interpretation of the intra-class correlation 
coeficient. From (26.57) it is seen that p; cannot be less than —. though it may 
attain +1 when o%, = o%. It is thus a skew coefficient in the sense that a negative 
value has not the same significance (as a departure from independence) as the equivalent 
positive value. | 

In point of fact, the intra-class coefficient is, from most points of view, more con- 
veniently considered as (a simple linear transform of) a ratio of variances between classes 
and within classes in the Analysis of Variance. Fisher (1921c) derived the distribution 
of intra-class 7; from this approach for the case when families are of the same size R. 
When k = 2, he found, as for the product-moment coefficient 7, that the transformation 


2 = artanhrs, 
gives a statistic (z) very nearly normally distributed with mean ¢ = artanhp, and 


variance independent of p;. For k > 2, a more complicated transformation is neces- 
sary. His results are given in Exercise 26.14. 


Tetrachoric correlation 

26.27 We now discuss the estimation of p in a bivariate normal population when 
the data are not given in full detail. We take first of all an extreme case exemplified 
by Table 26.6. ‘This is based on the distribution of cows according to age and milk- 
yield given in Table 1.24, Exercise 1.4. Suppose that, instead of being given that 
table we had only 


Table 26.6—Cows by age and milk-yield 


Age 6 Age 3-5 TOTAL 

and over years | 
Yield 8-18 galls. 1078 1407 | 2485 
Yield 19 galls. and over 1546 881 | 2427 
TOTAL 2624 2988 1° “4913 
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This is a highly condensed version of the original. Suppose we assume that the under- 
lying distribution is bivariate normal. How can we estimate p from this table? In 
general, for a table of this “2x2” type with frequencies 


a b a+b 
c fo 2 ee (26.58) 
a+c b+d a+b+c+d=n 


we require to estimate p. In (26.58) we shall always take d to be a frequency such that 
neither of its marginal frequencies contain the median value of the variate. 
If this table is derived by a double dichotomy of the bivariate normal distribution 
2 2 
f(w,y) oc ayexpd—- — (_2P*9 Y\\ (26.59) 
2(1 =p) 


Of 010% O35 


we can find h’ such that 
at+ec 


pes | fess) dvd course 


Putting h = h’'/o,, we find this is 


h 
(2n)-4 | exp(—Jat)de = “76 (26.60) 


and thus h is determinable from tables of the univariate normal distribution function. 
Likewise there is a k such that 
. a+b 
(2n)-# |" exp(—4y*)dy = (26.61) 
J —@ nN 
On our convention as to the arrangement of table (26.58), 4 and k are never negative. 
Having fitted univariate normal distributions to the marginal frequencies of the 
table in this way, we now require to solve for p the equation 


Paha | nit Bey ou 2 26.62 
. 3 roe? {aq (* 2pxy+y )} dey (26.62) 


The integrand in (26.62) is standardized because h and k were standardized deviates. 


We expand the integrand in ascending powers of p. The characteristic function of 
the distribution is 


b(t,u) = exp{—1(2+2ptu+u?)}. 
Thus, using the bivariate form of the Inversion Theorem (4.17), (26.62) becomes 


d co fo 1 00 ee) : - 
is [ | {oa [ [7 #Guexp(—itx—iny) dtdu) dedy 


oraor | oe) oe) : oo ( = py’ bu) : 
= | | | | exp{—3 (2+ u?)—itx—iuy} “at dur dx dy. (26.63) 
hJdk ro co J — oo j=0 j: 
The coefficient of (—>p)’/j! is the product of two integrals, of which the first is 


fs, Be { “te { exp (—Jut—itx) de| dy (26.64) 


306 THE ADVANCED THEORY OF STATISTICS 


and the second is I(y,k,u). Now from 6.18 the integral in braces in (26.64) is equal to 
(—i)! H, (x) a(x) 
where 
a(x) = (2n)-4exp (— 4x"). 
By (6.21), 
d 
— © Hhy_s(x)a(2)} = H,(a)a(x). 


Hence the double integral in (26.64) is 


I(x,h,t) = | (-1)'# Hy_a(x)a(x) |” = (-i Hya@a(h). (26.65) 
Substituting from (26.65) for I(x,h,t), I(y,k,u) in (26.63), we have the series 
a ; 
— = ¥ © Hy_1(h) Hj-1(k)a(h)a(A). (26.66) 
nm  j=0J! 


In terms of the tetrachoric functions which were defined at (6.44) for the purpose, 


eS Se (26.67) 
nN j=0 


26.28 Formally, (26.67) provides a soluble equation for p, but in practice the 
solution by successive approximation can be very tedious. (The series (26.67) always 
converges, but may do so slowly.) It is simpler to interpolate in tables which have 
been prepared giving the integral d/n in terms of p for various values of h and k (Tables 
for Statisticians and Biometricians, Vol. 2). 

The estimate of p derived from a sample of m in this way is known as tetrachoric r. 
We shall denote it by 7. 


Example 26.11 


For the data of Table 26.6 we find the normal deviate corresponding to 
2624/4912 = 0:5342 as h = 0-086, and similarly for 2485/4912 = 0-5059 we find 
k = 0-015. We have also for d/n the value 881/4912 = 0-1794. 

From the tables, we find for varying values of h, k and p the following values of d: 


h=0 h=0-1 h=0 h=01 

k=0 0-2015 0-1818 = 035 k=0 0-1931 0-1735 
£2012) 01818 = 0168 k cid Q47355— 0°1555 
Linear interpolation gives us for 4 = 0-086, k = 0-015, the result p = —0-32 
approximately. In the table, we have inverted the order of columns, and taking account 


of this gives us an estimate of p = +0-32. We therefore write r,; = +0-32. (The 
product-moment coefficient for Table 1.24 is r = 0-22.) 


= 050 


26.29 ‘Tetrachoric r, has been used mainly by psychologists, whose material is 
often of the 2x2 type. Its sampling distribution, and even its standard error, is not 
known in any precise form, but Karl Pearson (1913) gave an asymptotic expression 
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for its standard error. ‘There are, however, simpler methods of calculation based on 
nomograms (Hayes (1946); Hamilton (1948); Jenkins (1955)) and tables for the 
standard error in approximate form (Guilford and Lyons (1942); Hayes (1943) ; 
Goheen and Kavruck (1948)). It does not seem to be known for what sample size 
such standard errors may safely be used. 

For a generalization to polychoric estimation in rxc tables, see 33.35 below. 


Biserial correlation 


26.30 Suppose now that we have a (2 x q)-fold table, the dichotomy being according 
to some qualitative factor and the other classification either to a numerical variate or 
to a qualitative one, which may or may not be ordered. 

Table 26.7 will illustrate the type of material under discussion. The data relate 
to 1426 criminals classified according to whether they were alcoholic or not and according 


Table 26.7—Showing 1426 criminals classified according to alcoholism and 
type of crime 


(C. Goring’s data, quoted by K. Pearson, 1909) 


Arson Rape Violence Stealing Coining Fraud TOTALS 
Alcoholic . 50 88 155 379 18 63 753 
Non-alcoholic 43 62 110 300 14 144 673 
TOTALS 93 150 265 679 32 207 1426 


to the crime for which they were imprisoned. Even though the columns of the table 
are not unambiguously ordered (they are shown arranged in order of an association 
of the crimes with intelligence, but this ordering is somewhat arbitrary), we may still 
derive an estimate of p on the assumption that there is an underlying bivariate normal 
distribution. For in such a distribution, p? = 7, the regressions both being linear, 
and we remarked in 26.21 that 7? is invariant under permutation of arrays. We there- 
fore proceed to estimate 7?(= p?) as follows. 

Consider each column of Table 26.7 as a y-array, and let m, be the number of 
observations in the pth array, n = Un,, u, the mean of y in that array, « the mean and 
o? the variance of y, and o? the variance of y in the pth array. We suppose all measure- 
ments in y to be made from the value k which is the point of dichotomy ; this involves 
no loss of generality, since p? and 7? are invariant under a change of origin. ‘Then 
the correlation ratio of y on x (cf. (26.40)) is estimated by 

1 « : ; 
Ho 1 
Oy Ngai 02 
But for the bivariate normal distribution 7? = p? and (cf. 16.23) 


op /oy = var (y|x)/oy = (1—p*), 


q 2 2 
5 Mole Op _ My 
e 2 ne 

Oy Cn 


(26.68) 
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so we replace o;/o; by (1—p?) in (26.68), obtaining 
_ 1p? 2 nye 
oe ae es ee (26.69) 


NM y=1 O% Py 
“ffaa t 
p 
a Net ee ee (26.70) 


This estimator is known as biserial n because of the analogy with the correlation 
ratio. We shall write it as 7, when estimating from a sample, to maintain our con- 
vention about the use of Roman letters for statistics. 

The use of the expression (26.70) lies in the fact that the quantities in it can be 
estimated from the data. Our assumption that there is an underlying bivariate normal 
distribution implies that the quantity according to which dichotomy has been made 
(in our example, alcoholism) is capable of representation by a variate which is normally 
distributed, and that each y-array is a dichotomy of a univariate normal distribution. 
Thus the ratios (u,/o,) and (u,/o,) can be estimated from the tables of the normal 
integral. For example, in Table 26.7, the two frequencies ‘‘ alcoholic ’’ and “ non- 
alcoholic” are, for arson, 50 and 43. ‘Thus the proportional frequency in the alcoholic 
group is 50/93 = 0-5376 and the normal deviate corresponding to this frequency is 
seen from the tables to be 0-0944, which is thus an estimate of | “,/o,| for this array. 


Example 26.12 
For the data of Table 26.7, the proportional frequencies, the estimated values of 
| u,/o,| and |u,/o,|, and the n, are: 


Arson Rape Violence | Stealing | Coining Fraud TOTALS 
Alcoholic . . | 0°5376 | 0-5867 | 0:5849 | 0-5582 | 0-5625 0-3043 | 0-5281 
|Up/Op| . . | 0:0944  0-2190 | 0-:2144 | 0:1463 | 0-1573 0:5119 | 0:0704 = | uy/oy| 
Np > =e 93 150 265 679 oe 207 1426 =n 


Then from (26.70) we ee 


——__ {93 (0-:0944)?4+ ...}—(0-0704)? 


128 — 0-05456 


1, = 
2 
1+ 546% (0-:0944)?+ ...} 


or 
Er, | = 0-234, 

which, on our assumptions, may be taken as estimating the supposed product-moment 

correlation coefficient. 
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26.31 As for the tetrachoric 7,, the sampling distribution of biserial 7, is unknown. 
An asymptotic expression for its sampling variance was derived by K. Pearson (1917), 
but it is not known how large n must be for this to be valid. 

Neither 7; nor 7,, can be expected to estimate p very efficiently, since they are based 
on so little information about the variables, and it should be (though it has not always 
been) remembered that the assumption of underlying bivariate normality is crucial 
to both methods. In the absence of the normality assumption, we do not know what 
r, and 7, are estimating in general. 


26.32 If in the (2x q)-fold table the q-fold classification, instead of being defined 
by an unordered classification as in Table 26.7, is actually given by variate-value, we 
may proceed directly to estimate p instead of 7. For we may now use the extra infor- 
mation to estimate the variance o2 of this measured variate and its means, 4, 2, in 
the two halves of the dichotomy according to y. Since the regression of x on y is 
linear we have (cf. (26.12) ) 


E(ely) te = p22(9—1) (26.71) 
We can, as in 26.27, find k such that 
1—F(k)= (22) 3 exp (—4u*) du = 
k 


Ny 
Ny+n, 
where 7, is the total number of individuals bearing one attribute of the y-class (‘‘ higher ”’ 
values of y) and m, is the number bearing the other. k is the point of dichotomy of 


the normal distribution of y. 
From (26.71), the means (y,,,), (¢ = 1, 2) of each part of the dichotomy will be 
on the regression line (26.71). Thus, for the part of the dichotomy with the “ higher ” 


value of y, say 4, 
_ (E@ly)— He / Y1— by 
Ox Oy 


Thus we may estimate p by 
— x / Yaa by (26.73) 
‘ “SS 2 


where «,, * are the means of x in the “ high-y”’ observations and the whole table 
respectively, while s2 is the observed variance of « in the whole table. ‘The denominator 
of (26.73) is given by 


Yi-Hs _ Op) { ” wexp(—4u®) du / (2n)-# | ” exp (— 4) du 
k k 


= (2n)-texp(—3F") / (5 a] (26.74) 
by (26.72). 


If, then, we denote the ordinate of the normal distribution at k uz Zr, we have the 


estimator of p 
(#: Sex *) ny 1 
Ty = Se 
Sy (14 + Ng) Sk 


(26.72) 
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We write the estimator based on this equation as 7,, the suffix denoting ‘ biserial ”’ : 


r, is called ‘“ biserial r.”’ 
The equation is usually put in a more symmetrical form. Since 


x= (ny Xy+Nyg Ky) /(ny +N), 


*¥,—* is equal to m.(*,—%,_)/(n,+n,). Writing p for the proportion m,/(m,+n,) and 
gq = 1—p, we have the alternative expression for (26.74) 
ee eee 2 (26.75) 
Sy Rk 


Example 26.13 (from K. Pearson, 1909) 

Table 26.8 shows the returns for 6156 candidates for the London University Matricu- 
lation Examination for 1908/9. The average ages for the two higher age-groups have 
been estimated. 


Table 26.8 
Age of candidate Passed Failed ‘TOTALS 
16 583 563 1146 
17 666 980 1646 
18 525 868 1393 
19-21 383 814 1197 
22-30 214 439 653 
(mean 25) | 
over 30 40 81 121 
(mean 33) 
‘TOTALS 2411 3745 6156 


Taking the suffix “‘1” as relating to successful candidates, we have 
%, = 18-4280. 
For all candidates together 
£ = 418-7660, 5 = (3:2550)". 
The value of p is 2411/6156 = 0-3917. 
(26.72) gives 1—F(k) = 0:3917, and we find k = 0-275 and 2, = 0-384. Hence, 


from (26.74), 
0:3405 0-3917 
‘a> Shee OSE 
The estimated correlation between age and success is small. 
26.33 As for r, and 7,, the assumption of underlying normality is crucial to 7,. 


The distribution of biserial 7, is not known, but Soper (1914) derived the expression 
for its standard error in normal samples 


Lae tO py 
var 1, “le +p {Pa +0 Le D +24), (26.76) 


LINEAR REGRESSION AND CORRELATION 311 


and showed that (26.76) is generally well approximated by 
2 (Pp 2] 


al 
varr, ~ —|% 

n Si; 
More recently 7, has been extensively studied by Maritz (1953) and by Tate (1955), 
who showed that in normal samples it is asymptotically normally distributed with 
mean p and variance (26.76), and considered the Maximum Likelihood estimation of p 
in biserial data. It appears, as might be expected, that the variance of r, is least, for 
fixed p, when the dichotomy is at the middle of the dichotomized variate’s range (y = 0). 
When p = 0, 7, is an efficient estimator of p, but when p? —> 1 the efficiency of r, tends 
to zero. ‘Tate also tables Soper’s formula (26.76) for varr,. Cf. Exercises 26.10-12. 


Point-biserial correlation 


26.34 ‘This is a convenient place at which to mention another coefficient, the point- 
biserial correlation, which we shall denote by p,,, and by 7,, for a sample. Suppose 
that the dichotomy according to y is regarded, not as a section of a normal distribution, 
but as defined by a variable taking two values only. So far as correlations are con- 
cerned, we can take these values to be 1 and 0. For example, in Table 26.8 it is not 
implausible to suppose that success in the examination is a dichotomy of a normal 
distribution of ability to pass it. But if the y-dichotomy were according, say, to sex, 
this is no longer a reasonable assumption and a different approach is necessary. 

Such a situation is, in fact, fundamentally different from the one we have so far 
considered, for we are now no longer estimating p in a bivariate normal population : 
we consider instead the product-moment of a 0—1 variable y and the variable x. If 
P is the true proportion of values of y with y = 1, O = 1—P, we have from binomial 
distribution theory | 

E(y) =P, oj = PQ 
and thus, by definition, 


x1 _ E(xy)—PE(x) 


‘” g.0, - a,{POy 
1 = = | nN; 
We estimate E(xy) by m,, = =e = ¥4, 1: (%) by #, o, by s,, and P by p= Pee 
obtaining 
= p%,—p (pe, +q%s) 
= s.(p 49)" 
_ (%1—*2) (Pq)? (26.77) 
% : 


26.35 1, in (26.77) may be compared with the biserial 7, defined at (26.75). We 
have 


(26.78) 
It has been shown by Tate (1953) by a consideration of Mills’ ratio (cf. 5.22) that the 


x 
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expression on the right of (26.78) is < (2/a)* and the values of the coefficients will 
thus, in general, be appreciably different 

Tate (1954) shows that r,, is asymptotically normally distributed with mean py» 
and variance 


a 5 2 

var’ ~ Caw (1 -5 Poot fm) (26.79) 
which is a minimum when p = q = i. 

Apart from the measurement of correlation, it is clear from (26.77) that, in effect, 

for a point-biserial situation, we are simply comparing the means of two samples of 


a variate x, the y-classification being no more than a labelling of the samples. In fact 


NN» 


2, . (*1 ¥o) Nn +Ns» = t2 (26 80) 
1-72, 2 (xyj—%,)2 +2 (%oe—He)? = my ty—2” 
where t is the usual “ Student’s ” f-test used for comparing the means of two normal 
populations with equal variance (cf. Example 23.8). Thus if the distribution of x is 
normal for y = 0, 1, the point-biserial coefficient is a simple transformation of the 2° 
statistic, which may be used to test it. 


26.36 The above account does not exhaust the possible estimators of bivariate 
normal p from data which are classified in a two-way table. In Chapter 33 we shall 
discuss some estimators based on rank-order statistics. 


EXERCISES 


26.1 Show that the correlation coefficient for the data of Table 26.2 is +0:072. 
Show that the regression lines in Fig. 26.2 are: 


CC’: y = 0-:0938x+ 30:56; RR’: x = 0:0547y + 61:06. 


26.2 Writing the bivariate frequency function in the form 
f (x,y) = f (x)e(y |), 
so that the jth moment about the origin of the y-array for given x is 
co 
13 (y|*) = | yi e(y|x)dy, 
—o 


show that 


df b(t, u) e = ; , 
se = | elt f (x) wi (y |x) de 
u=0 — 
(where ¢ is the characteristic function of the distribution), so that 
P (—iy [° a ¢ 
Fedny(la) = GEL ete | Sa] at 
~ u=0 

Use this to verify that the bivariate normal distribution has linear regressions and is 


homoscedastic. 


(Wicksell, 1934) 
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26.3. A bivariate normal distribution is dichotomized at some value of y. The 
variance of x for the whole distribution is known to be o? and that for one part of the 
dichotomy is 07. The correlation between x and y for the latter is c. Show that the 
correlation of x and y in the whole distribution may be estimated by 7, where 


2 
= 1-S (1-2). 


26.4 In the previous exercise, if 07 is the variance of y in the whole distribution and 
o2 is its variance in the part of the dichotomy, show that p for the whole distribution 
may be estimated by 


ce oF 


o5+ +c (oe? —o?) 


26.5 Show that whereas tetrachoric r;, biserial r,, and point-biserial r,,) can never 
exceed unity in absolute value, biserial r» may do so. 


26.6 Prove that the tetrachoric series (26.67) always converges for |p| < 1. 


26.7. A set of variables x,, X2,..., Xn are distributed so that the product-moment 
correlation of x; and x; is pij. They all have the same variance. Show that the average 
value of pij defined by 


1 
p * >» ; i #7, 
P= ST es Pij J 


must be not less than —1/(n—1). 


26.8 Inthe previous exercise show that | pi; |, the determinant of the array of correla- 
tion coefficients, is non-negative. Hence show that 


Pia t Pig t+ Pas < 14+2p12 P13 Pos 


26.9 Show from (16.86) that in samples from a bivariate normal population the 
sampling distribution of b., the regression coefficient of y on x, has variance 


2 
var 6, = — = (1 — p?) 
exactly, and that its skewness and kurtosis coefficients are 
vi = 9, 
6 
— Ss 


26.10 Let p{(x—)/o, y} denote the bivariate normal frequency with means of x 
and y equal to uw and 0 respectively, variances equal to o* and 1 respectively, and correla- 
tion p. Define 
@ 


ydy, (x, o= | yp dy. 


— 0 


E (x, ) = & 


@ 


If 2; is a random variable taking the values 0, 1 according as y < w or y = a, show that 
in a biserial table the Likelihood Function may be written 


n ee Bose 
L(x, y|, p, 4,6) = II {208 ee 0) +0209 (—% o)}. 
ree | Oo Oo 
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If 0? represents a partial differential of the second order with respect to any pair of para- 
meters, show that 
E(@logL) = n{i—p(x) } £, (07 log 7) + np (x) E, (# log £) 


where 
Tee | (20)-texp (— 32) dt, 
zx 


and E,, E, are conditional expectations with respect to x for y < w, y > respectively. 
Hence derive the inverse of the dispersion matrix for the Maximum Likelihood estimators 
of the four parameters (the order of rows and columns being the same as the order of 
the parameters in the LF): 


7, PO ao~ 4s Pa pay 
. 1 — p? o o 
A,—2pwa,t+p*way p*way—pay p*wa,—pay, 
So (1— p*)? (ip) a (1—p) 
(i = 9*) 1—p?+p?a, pa, 
o* o* 
2(1—p?) +p? a, 
o2 
where a; = | x* g(x, w, p) dx, 
= W — px px—@ 
x, , p) = (20)-texp(— 4x? ja 
and $ (x) = (27)~* exp (— 3x") /{1—p (x)}. 


By inverting this matrix, derive the asymptotic variance of the Maximum Likelihood 
estimator fy, in the form 


(o.@) 
| gdx 
(1—p% = =A ete 


we = he = = 2 = 
| gdv| tede-(| vedy) 
—o — —.00 


26.11 In Exercise 26.10, show that when p = 0, 
. _ 2xp(w) (1—p(o)} 


eae nexp (— k?) 


(Tate, 1955) 


By comparing this with the large-sample formula (26.76), show that when p = 0, rp is a 
fully efficient estimator. 


(Tate, 1955) 


26.12 In Exercise 26.10, show that var fy tends to zero as |p| tends to unity, and 
from (26.76) that mvarr, does not, and hence that ry is of zero efficiency near |p| = 1. 


(Tate, 1955; the results of Exercises 26.10-12 are extended to 
the multinormal distribution by J. F. Hannan and Tate (1965).) 


26.13 Establish equations (26.19) and (26.20). 
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26.14 Writing / for the sample intraclass coefficient and A for the parent value, show 
that the exact distribution of J is given by 


(1 —D2P@—-D-144 + (R—1) 12-3) dl, 

where / is calculated from p families of k members each, reducing in the case k = 2 to 
I'(p—3) 

I (p—1) (2m): 

where / = tanh z, A = tanh €. Hence show that, for k = 2, z—é is nearly normal with 


dF ox 


adF = sech?—2 (z— &)exp {- 4 (z—€) } 


mean zero and variance 5 
(Fisher, 1921c) 


26.15 Show that for testing p = py in a bivariate normal population, the Likelihood 
Ratio statistic is given by 3 
nin - 2-7)! 
(1 —r po) 
so that J!/” = (1—r?)? when p, = 0, and when py, #0 we have 
(1 — po) 1" = 1+rpotr?(pp—3)+... 


26.16 Show that the effect of applying Sheppard’s corrections to the moments is 
always to increase the value of the correlation coefficient. 


26.17 Show that if x and y are respectively subject to errors of observation u, v, 
where u and wv are uncorrelated with x, y and each other, the correlation coefficient is 
reduced (“‘ attenuated ’’) by a factor 


o o?\)2 
{(1+3) (143) 


26.18 If x4, x2, x; are mutually uncorrelated with positive means and small coefficients 
of variation Vi(z=1, 2, 3), show that the correlation between x,/x3; and Xx2/x 3 is 
approximately 
= V5 = 

{(Vi+ V5) (V+ V5) }? 
(This is sometimes called a ‘‘ spurious ’”’ correlation, the reason being that the original 
variables were uncorrelated, but it is not a well-chosen term.) 


p 0. 


26.19 If two bivariate normal populations have p, = pz, = p, the other parameters 
being unspecified, show that the Maximum Likelihood estimator of p is 


n(1+7y7r2)— {n?(1—171 72)? —4y 12 (71 —1)*}2 
2 (1, To+N213) 


= 


d 


where i, r; are the sample sizes and correlation coefficients (¢ = 1, 2) and nm = n,+ 75. 
If n, = n, = 4n, show that if z,, 2, are defined by (26.25), and . 


f= Hos (7*4) 
i? 


f= 3 (21+ 22) 


then 


exactly. 
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26.20 Using the result of the last exercise, show that the Likelihood Ratio test of 
Py = Pg when n, = ny uses the statistic 


p/n = sech{4 (2, — 22) b 


so that it is a one-to-one function of z,;—22, the statistic suggested in 26.19. 
(Brandner, 1933) 


26.21 In Exercise 26.19, show that if n, 4 m,, we have approximately for the ML 
estimator of ¢ 


= 1 
c= 7, (tt 21 + a 22); 


and hence that the LR test of p,; = p. uses the statistic 


sech fe (z1- =} | ‘ sech Z (z,- =} | 2 
n n 


approximately, again a one-to-one function of (z,—22). 
(Brandner, 1933) 


26.22 ‘To estimate a common value of p for two bivariate normal populations, show 
that 


nN, —3) 2, +(n_g—3) 22 
nN, +n,—6 


ge = | 


is the linear combination of z, and zg, with minimum variance as an estimator of ¢, but 
that when 1, # nm, this does not give the Maximum Likelihood estimator of p given in 
Exercise 26.19. 


26.23 Show that the correlation coefficient between x and y, pzy, satisfies 
—— var (#2 + B2 x) = 4 _Etly—(@at Bex) ]? } 


ere SES 


03 03 
and hence establish (26.18). 


26.24 Writing z = E(x|¥), show that 
Pr = 
and that 
Pye = Pry/M: 
Hence show that (26.18) implies (26.46) and establish the conditions under which the 


various equalities in (26.46) hold. 
(M. Fréchet published these relations in 1933-1935 ; see Kruskal (1958) ) 


CHAPTER. 27 


PARTIAL AND MULTIPLE CORRELATION 


27.1 In normal or nearly-normal variation, the correlation parameter p between 
two variables can, as we saw in 26.10, be used as a measure of interdependence. When 
we come to interpret ‘‘ interdependence ” in practice, however, we often meet difh- 
culties of the kind discussed in 26.4: if a variable is correlated with a second variable, 
this may be merely incidental to the fact that both are correlated with another variable 
or set of variables. ‘This consideration leads us to examine the correlations between 
variables when other variables are held constant, i.e. conditionally upon those other 
variables taking certain fixed values. ‘These are the so-called partial correlations. 

If we find that holding another variable fixed reduces the correlation between two 
variables, we infer that their interdependence arises in part through the agency of that 
other variable ; and, if the partial correlation is zero or very small, we infer that their 
interdependence is entirely attributable to that agency. Conversely, if the partial 
correlation is larger than the original correlation between the variables we infer that 
the other variable was obscuring the stronger connection or, as we may say, “‘ masking ”’ 
the correlation. But it must be remembered that even in the latter case we still have 
no warrant to presume a causal connection: by the argument of 26.4, some quite 
different variable, overlooked in our analysis, may be at work to produce the correlation. 
As with ordinary product-moment correlations, so with partial correlations: the pre- 
sumption of causality must always be extra-statistical. 


27.2 In this branch of the subject, it is difficult at times to arrive at a 
notation which is unambiguous and flexible without being impossibly cumbrous. 
Basing ourselves on Yule’s (1907) system of notation, we shall do our best to steer a 
middle course, but we shall at times have to make considerable demands on the reader’s 
tolerance of suffixes. : 

As in Chapter 26, we shall discuss linear regression incidentally, but we leave over 
the main discussion of regression problems to Chapter 28. . 


Partial correlation 

27.3 Consider three multinormally distributed variables. We exclude the singular 
case (cf. 15.2), and lose no generality, so far as correlations are concerned, if we standard- 
ize the variables. ‘Their dispersion matrix then becomes the matrix of their correla- 
tions, which we shall call the correlation matrix and denote by C. ‘Thus if the correla- 
tion between x; and x; is p;;, the frequency function becomes, from (15.19), 


3 
iiss, X95 #5) = Oa)"; Cem {— aa p> Cuter} (224) 


j=1 
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where C;,; is the cofactor of the (7,j)th element in the symmetric correlation determinant 


1 pie Pis 
|C| = 1 pes]. (27.2) 
1 


C,;/|C| = C% is the element of the reciprocal of C. We shall sometimes write the 
determinant or matrix of correlations in this way, leaving the entries below the leading 
diagonal to be filled in by symmetry. 

“The c.f. of the distribution is, by (15.20), 


3 
$ (t1, te) tz) = exp {-3 LU pit i (27.3) 
biel 


27.4 Consider the correlation between x, and x, for a fixed value of x,. The 
conditional distribution of x, and x, given X3, 1s 
g(x, X_|%,) oc exp{—$ (CU xt+2C x, w.4+ Cx + 2C% x, x3 +2C% x, x3) } 
oc exp {—$[C™ (1 — §1)? +20 (x1 —§1) (%2—F 2) + C?(%2—Es)?]}, (27.4) 
where 
Cue, = Ce. —— C13 ve 
CHE + CE, = —C x,. 


From (27.4) we see that, given x3, x, and x, are bivariate-normally distributed, with 
correlation coefficient which we write 
C12 
Pug (CuC2)F 
Clearly, p;.3 does not depend on the actual value at which x, is fixed. Furthermore, 
cancelling the factor in |C|, we have 
ee 
: (Cy1C 20)? 


= Piz PisPas 
{(1 = pis) (1 — pes) 3? 
from (27.2). pye.3 is called the partial correlation coefficient of x, and x, with x, 
fixed. It is symmetric in its primary subscripts 1, 2. Its secondary subscript, 3, refers 
to the variable held fixed. 
Although (27.5) has been derived under the assumption of normality, we now 
define the partial correlation coefficient by (27.5) for any parent distribution. 


(27.5) 


27.5 Similarly, if we have a p-variate non-singular multinormal distribution and 
fix (p — 2) of the variates, the resulting partial correlation of the other two (say x, X,) is 
= =f 
P12.34...p = (Cy, Con)” (27.6) 
where C;,; is the cofactor of p;; in 
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1 pia Pis +++ Pip 
1 peg +++ Pap 
1 eee P3p 
ferS = =e (27.7) 
A 


Like (27.5), (27.6) is to be regarded as a general definition of the partial correlation 
coefficient between x, and x, with x4,..., %, fixed. 


27.6 It is instructive to consider the same problem from another angle. 
Write f(x, ..-, Xx| Xr, --+5 Xp) for the conditional joint frequency function of 
Ny, 20+, X_ When xp43, ..., X, are fixed, and g(x,41, ..., ¥%,) for the marginal joint 
distribution: of “%j;.1,.... «5X. 

The joint c.f. of the p variables is 


ru eeerarepers = 
p 
ee | Se | few ee Rg rs x,)exp( 3 ity) de oe ee, 
i= 


Pp 
os | oe | belts sgn ee a [ Sit, Gly, «<= Oh, 
5 a 


where ¢; (1, .- +, te | Xe+1)-- +» %p) is the conditional joint c.f. of x,,...,%,. It follows 
from the multivariate Inversion Theorem (4.17) that 


1 s 
bp Z — a) —se | b( eee yg Z exp (- = it,» Atys1 = dt... (27.8) 


If we put 7; = t, =... = t = 0 in (27.8), we obtain, since ¢, then becomes equal 
to unity, 


1 
g= mnie is {s0, a = ee t,)exp (— 
Hence, dividing (27.8) by (27.9), 
p 
| — | b(t say tp eee (- 2 ity) de ...dty 
j=k+1 


4 = : 
| eee | 0, ee ey 0, Ceti, eeey i )exp ( - p> ity) de eee dt, 
jg=k+1 


This is a general result, suggested by a theorem of Bartlett (1938). 
If we now assume that the p variables are multinormal, the integrand of the 
numerator in (27.10) becomes, using the c.f. (15.20), 


Pp 
=k+1 


J 


(27.10) 


p p 
exp (-3 XY pytyty— tts; 
,j=1 j=k+1 
k D D p 
= exp(—4 2 putts) exp ( —1 2 putit,)exp(— > p2 putit,) exp (— x it; 
,j=1 ,j=k+1 l=1 j=k+1 j=k+1 
k Pp p k 
= exp (—4 z putits) exp (—4 =; putts) exp — > it, (xi puts). (27.11) 
ijl ,j=k+1 j=k+1 Fac 
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Now the integral with respect to t,4;,...,¢, of the last two factors on the right of 
(27.11) is the inversion of the multinormal c.f. of x,41,...,%, with x; measured 


k 
from the valuez 2 pjmt,. ‘This change of origins does not affect correlations. If we 
m=1 


write D for the correlation matrix of x,,,,...,x, alone, this gives for the integral 
of (27.11) a constant times 


k Pp k k 
exp (—4 p> putts) exp | —4 2 D¥( 1-4 p2 Pim'n) («,-i p> Pimtn) b 
- j= l,j=k+1 m=1 m=1 


From (27.10) we then find 


k 
Pr (tay ++ +5 te | Meta ++ +s Mp) = exp (—4 a puytit,) x 
WP bos 


p k k 
exp {-3 pee 2 (1-1 = Pint) (x,-1 = Pintn) +3 
m= = 


l,j=k+1 


Pp ’ 
2... DY x, x} (27.12) 


l 


Thus if o;,, denotes the covariance of x, and x, in the conditional distribution of 


Ny, ...+, Xz, and o,, their covariance unconditionally, we find, on identifying coefficients 
of .i,4, in (27,12), 
p 
Oy Ogg <2 DP” fg Pw (27.13) 
Lj=k+1 


This is in terms of standardized initial variables. If we now destandardize, the variance 
of x; being o%, each p is replaced by its corresponding o, D” is replaced by the dispersion 
matrix elements D”/(o,0;) and we have the more general form of (27.13) 


p 
Sue = Oyo— 2 De /(G 9). (27.14) 
Lj=k+1 


(27.14) does not depend on the values at which *,43,..., X, are fixed. 

If we write A for the (kxk) unconditional dispersion matrix {o,,,}, B’ for the 
(kx (p—k)) matrix {o,,} and E for the ((p—k) x (p—R)) dispersion matrix of which D 
is the standardized form, (27.14) states that the conditional dispersion matrix 


{o/,} = A—B’ E-!B. 


27.7 In particular, if we fix only one variable, say x,, we have D?? = 1 and the 
conditional covariance (27.14) becomes simply 


Oly = Syy— Fou Syx/0% = Fy Fv (Pur— Pup Pon)> (27.15) 
and if uw = v we have from (27.15) the conditional variance of u 
Oy = Ou(1— pip); 
and the last two formulae give the conditional correlation coefficient 


p = Puv — Pup Pvp 
ve = 1 
ee i ee 


another form of (27.5). 
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If we fix all but two variables, say x, and x,, we have from (27.14) 


P 
Piz p> D" py1 pj 
ee —" (27.16) 


Pati... 7 s. D p 3 
{(1- 5 D¥ pxspa) (1- 5 D¥ pispss) | 
Lj=3 1,j=3 


Inspection of (27.7) shows that the minor of p,,, namely, 


| Pei | Pas Poa- Pap Poi | Pse - Pp2 
eee Pa eg ee eee ee | — eee Ge Gee see ee oe 
| Pan | 1 P34+++P3p Psa | 
: . es ; 24 
= : : SS ee 
a ae 
| = . e ° | 
| 
= = 25 = 
| | 1 | 
Pp1 ; Pos Pra-e-s | Pop. | 


may be expanded by its first row and column as 
Pp 
P2e1|D|— = D" py1 Pas 
ne feee 
and similarly for the minors of py;, pea. ‘Thus (27.16) may be written 


Pic.s4...p = AAT 
=) ee eee ) 
C11C 22 


which is (27.6) again. 


Linear partial regressions 

27.8 We now consider the extension of the linear regression relations of 26.7 to p 
variates. For p multinormal variates x, with zero means and variances oj, the mean 
of x, if x,,...,%, are fixed is seen from the exponent of the distribution to be 


EX | Xap 2-5 %p) ge Cay HH, (27.17) 
01 jae Cy1 9; 

We shall denote the regression coefficient of x, on x; with the other (p— 2) variables 
held fixed by fij23....j-1,341,...p OY, for brevity, by fij¢, where q stands for “ the 
other variables than those in the primary subscripts,’ and the suffix to q is to distinguish 
different g’s. ‘The f,;, are the partial regression coefficients. 

We have, therefore, | 


Ey | ae. ep) Basi a Bagi + 2% + Ping, Xo (27.18) 
Comparison of (27.18) with (27.17) gives, in the multinormal case, 


Sige add eh 7 
Lj. Foy 5p (27.19) 
Similarly, the regression coefficient of x; upon x, with the other variables fixed is 
a; C; 
big =e (27.20) 


b] 
a, Cy; 
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and thus, since C,; = Cj, (27.6), (27.19) and (27.20) give 
C2; 
fsa Cog Fae bae (27.21) 


an obvious generalization of (26.17). (27.19) and (27.20) make it obvious that A,;,, 
is not symmetric in x, and x;, which is what we should expect from a coefficient of 
dependence. Like (27.5) and (27.6), (27.19) and (27.20) are definitions of the partial 
coefficients in the general case. 


Errors from linear regression 
27.9 We define the error™ of order (p—1) 
%2...p7 = %,— E(x; | x2, ce ey Xp). 
It has zero mean and its variance is 
Gis» = E (x2...) = E[ {x,—E(x,| x2, eeey xy }"1, 
so that of), is the error variance of x, about the regression. We have at once, 
from (27.18), 


P 2 
Oi2..p = E {a = Bijz.9; x,} | (27.22) 
j= 
Pp P p 
=E |, («1-3 Bij; 7) == Bij %)(%1— & Brjg, x), (27.23) 
= — j=2 
If we take expectations in two stages, first keeping x2,..., x, fixed, we find that the 


conditional expectation of the second product in the right of (27.23) is zero by (27.18), 
so that 


Pp p 
dis. = Eft © Bygmiay) = 3 psn (27.24) 
j= j= 


The error variance (27.24) is independent of the values fixed for x3,..., %, if the Aij., 
are independent of these values. The distribution of x, in arrays is then said to be 
homoscedastic (or heteroscedastic in the contrary case). ‘This constancy of error variance 
makes the interpretation of regressions and correlations easier. For example, in the 
normal case, the conditional variances and covariances obtained by fixing a set of variates 
does not depend on the values at which they are fixed (cf. (27.14) ). In other cases, 
we must make due allowance for observed heteroscedasticity in our interpretations : 
the partial regression coefficients are then, perhaps, best regarded as average relation- 
ships over all possible values of the fixed variates. 


Relations between variances, regressions and correlations of different orders 
27.10 Given p variables, we may examine the correlation between any pair when 
any subset of the others is fixed, and similarly we may be interested in the regression 
of any one upon any subset of the others. The number of possible coefficients becomes 
very large as p increases. When a coefficient contains k secondary subscripts, it is 
said to be of order k. Thus pj..3, is of order 2, pyo., of order 1 and p,, of order zero, 


(*) This is often called a “ residual’ in the literature, but we shall distinguish between 
errors from population linear regressions and residuals from regressions fitted to sample data. 
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while fi2.67s is of order 3 and of 67g is of order 4. In our present notation, the linear 
regression coefficients of the last chapter, 6, and f,, would be written £,, and Bo: 
respectively and are of order zero, as is an ordinary variance o”. 

We have already seen in 27.4 and 27.7 how any correlation coefficient of order 1 
can be expressed in terms of those of order zero. We will now obtain more general 
results of this kind for all types of coefficient. 


27.11 From (27.24) and (27.19) we have 
p / 
Ci2...p = oi+ & — =~ 045, (27.25) 


whence 


IC] 
oF 


1 
= 14316) =C, = 
c-(1Cl-Cu) 


or, using the definition of q given in 27.8, 


3 
l.g@ 


= 

Cu 
and similarly if 1 is replaced by any other suffix. More generally, it may be seen in 
the same way that 


= 0 


(27.26) 


|C| 

Ce 
which reduces to (27.26) when ] = m. (27.27) applies to the case where the secondary 
subscripts of each variable include the primary subscript of the other. If, on the 
other hand, both sets of secondary subscripts exclude / and m, we denote a common set 
of secondary subscripts by r. The covariance of two errors %,Xm+ is related to their 


correlation and variances by the natural extension of the definitions (26.10), (26.11) 
and (26.17), namely 


COV (Xia Xingu) = 010m 


(27.27) 


COV (X1.> ee yh ape = Bines 


COV (Xi, Sati Ge =< Dots (27.28) 


COV (c- anon vs (Op, es, = Plm.r 
agreeing with the relationship (27.21) already found. By adjoining a set of sufhixes, 1, 
to both variables x,, x, we simply do the same to all their coefficients. 


27.12 We may now use (27.26) to obtain the relation between error variances of 
different orders. Writing |D| for the correlation determinant of all the variables 
except ¥2, we have, from (27.26), 

2 >= peel, 
Dy; 
(where the suffix g—2 denotes the set q excluding «,) and 
P 2|C| 


O14 = O7 a | 
Ch 


2 
O71 g— 
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whence | | 
: O19 Dy, |C| 
= —,_., Zi 2d 
Oig-2 Cy |D| ( ) 
Now |D| = C,, by definition, and by Jacobi’s generalized theorem on determinants 
| Cu Cia] _ 
= o- [S| Dass (27.30) 
since D,, is the complementary minor of 
Paz - Pas 
Puaz Pes 
in C. Thus, using (27.30), (27.29) becomes 
Cir Cie 
Cig Cy. C2 1 Cie 
jt =! = 1- .~—— 27.31 
Oig—2 Ci Crs Cyi C22 ( 
or, using (27.6), | 
Cig = Oig—-2(1 — piz.q). (27.32) 


(27.32) is a generalization of the bivariate result given in Exercise 26.23, which in our 
present notation would be written 

021 = 03(1—pip). 
We have sie met this result in the special context of the bivariate normal distribution 


at (16.46). 


27.13 (27.32) enables us to express the error variance of order (p—1) in terms of 
the error variance and a correlation coefficient of order (p—2). If we now again use 
(27.32) to express oj, 2, we find in exactly the same way 


Fe z= Og—2 y= Pi3.q— 2). 
We may thus apply (27.32) successively to obtain, writing subscripts more fully, 
Ot2...p = oi (1 — Pip) (1 — Piw—n2) (1 — Pi(p—2).p—1)p) ° .(1 = pirs...9)- (27.33) 
In (27.33), the order in which the secondary subscripts of o;’53.._» are taken is evidently 


immaterial; we may permute them as desired. In particular, we may write for 
simplicity 


2 
C 
= = (27.34) 


by (27.26), the subscripts other than 1 in (27.34) being permutable. (27.34) enables 
us to express any error variance of order s in terms of the error variance of zero order 
and s correlation coefficients, one of each order from zero up to (s—1). 


27.14 We now turn to the regression coefficients. (27.15) may be written, for 
the covariance of x, and x, with x, fixed, — 


= 2 
O12. = 912—9ip O'n2/Ops 
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and if we adjoin the suffixes 3,...,(p—1) throughout, we have 

o 10 = 

eater 1) Yp2.3...(p a (27.35) 
On.3...(p—1) 

Using the definition (27.28) of a regression coefficient as the ratio of a covariance to 

a variance, 1.€. 


G3 ...6 = 98 ;..@-—h— 


Bit = Ox.1/ OFky 
we have from (27.35), writing r for the set 3,...,(p—1), 


B12.pr 03. pr = Biz. 03 oo Bis+ Boor 3 


or 
02 
P12.0r = oe, Przr— Pips Byer) (27.36) 
pr 
If we put x, = x, in (27.36), we obtain 
O35. pr = o3,(1 — Pops Byer) — O34 (1 — Pp.r)s (27.37) 
another form of (27.32). Thus, from (27.36) and (27.37), 
foo = Bro.r— Pip Bros (27.38) 


1— Bons Bsr 
the required formula for expressing a regression coefficient in terms of some of those 
of next lower order. Repeated applications of (27.38) give any regression coefficient 
in terms of those of zero order. 


Finally, using (27.21), we find from (27.38) 


se : oes ee ee 97 39 
P12.pr (B12. Bor. ) fd == et) (1 z= P3p.r) Va ( ) 


which is (27.5) generalized by adjoining the set of suffixes 1. 


Approximate linear partial regressions 

27.15 In our discussion from 27.8 onwards we have taken the regression relation- 
ships to be exactly linear, of type (27.18). Just as in 26.8, we now consider the question 
of fitting regression relationships of this type to observed populations, whose regressions 
are almost never exactly linear, and by the same reasoning as there, we are led to the 
Method of Least Squares. We therefore choose the Bj, to minimize the sum of 
squared deviations of the m observations from the fitted regression. 


n Pp 
2 Aiea = Baja cr (27.40) 
= I= 


where we measure from the means of the x’s, and assume m > p. ‘The solution is, 
from (19.12), : 

@ = (X’X)-1X’x,, (27.41) 
where the matrix X refers to the observations on the (p—1) variables x9,..., %p, 
and x, is the vector of observations on that variable. (27.41) may be written 


B = (nV,_1)-1(n M) = V>-. M, (27.42) 
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where V,_; is the dispersion matrix of x,...,*, and M is the vector of covariances 
of x, with «(7 =2,.<.5 fs Pus 
=. 
Bijas = pra BV -1)an ou- (27.43) 
| Vy=t11 [=2 


Since | V,_;| is the minor V,, of the dispersion matrix V of all p variables, (V,_1);: is 
the complementary minor of 

of Oy 
oy Sy 
in V, so that the sum on the right of (27.43) is the cofactor of (—o,;)in V. Thus (27.43) 
becomes 


i = ay = i Se 
= Via 0; Cy 
(27.44) is identical with (27.19). Thus, as in 26.8, we reach the conclusion that the 
Least Squares approximation gives us the same regression coefficients as in the case 
of exact linearity of regression. 
It follows that all the results of this chapter are valid when we fit Least Squares 
regressions to observed populations. 


(27.44) 


Sample coefficients 

27.16 If we are using a sample of observations and fit regressions by Least 
Squares, all the relationships we have discussed will hold between the sample coefficients. 
Following our usual convention, we shall use 7 instead of p, b instead of £, and s? instead 
of o? to distinguish the sample coefficients from their population equivalents. ‘The 
b’s are determined by minimizing the analogue of (27.40) 


n Pp 2 
nin» = = (x1 E be vs) | (27.45) 
i= j= 


and we have as at (27.21) 
VF = bix.x Di ts 
while the analogues of (27.34), (27.38) and (27.39) also hold. 


If we equate to zero the derivatives of (27.45) with respect to the b,; (which is the 
method by which we determine the b,;), we have the (p—1) equations 


” Pp 
a (y- 2 1j.95%i2) = 0, = a 3, as oP» 
= j=4 


which we may write 
Lehi pS A= 2 a (27.46) 
the summation being over the m observations. 193...) is the residual from the fitted 
regression—cf. 27.9. From (27.46) it follows that 
xg = UN 9 (Hy — Vd 49, %;) = UX _%1, (27.47) 
and similarly 
type, = DML, Xe = Rj Mey; (27.48) 
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where r is any common set of secondary subscripts. . Relations like (27.47) and 
(27.48) hold for the population errors as well as the sample residuals, but we shall 
find them of use mainly in sampling problems, which is why we have expressed 
them in terms of residuals. Exercise 27.5 gives the most general rule for the omission 
of common secondary subscripts in summing products of residuals. 


Estimation of population coefficients 

27.17 As for the zero-order correlations and regressions of the previous chapter, 
we may use the sample coefficients as estimators of their population equivalents. If 
the regression concerned is linear, we know from the Least Squares theory in Chapter 19 
that any 6 is an unbiassed estimator of the corresponding f and that wares i S128... = 
is an unbiassed estimator of 0723...» However, no r is an unbiassed estimator of its p: 
we saw in 26.15-17 that even for a zero-order coefficient in the normal case, r is not 
unbiassed for p, but that the modification (26.34) or (26.35) is an unbiassed estimator. 
A result to be obtained in 27.22 will enable us to estimate any partial correlation 
coefficient analogously in the normal case. 


Geometrical interpretation of partial correlation 
27.18 From our results, it is clear that the whole complex of partial regressions, 
correlations and variances or covariances of errors or residuals is completely determined 
by the variances and correlations, or by the variances and regressions, of zero order. 
It is interesting to consider this result from the geometrical point of view. 
Suppose in fact that we have m observations on p (< m) variates 


N119 222X135 Xo13 2229 Xopy oc 0 9 Xnyy 2 20 yg Mn 


Consider a (Euclidean) sample space of m dimensions. To the observations 
X1py «+ +5%pz On the Ath variate, there will correspond one point in this space, and 
there are p such points, one for each variate. Call these points O;, Q2,...,Q,. We 
will assume that the x’s are measured about their means, and take the origin to be P. 

The quantity no? may then be interpreted as the square of the length of the vector 
joining the point Q, (with co-ordinates x ),...,%,;) to P. Similarly pj, may be 
interpreted as the cosine of the angle Q,PQ,,, for 


n 
p> Xiu Xim 
= j=l 


Pim =~ n n 3? 
2 2 
23 Xi >> Xim 
j j=1 


j=] j= 


which is the formula for the cosine of the angle between PQ, and PQ,,. ; 
Our result may then be expressed by saying that all the relations connecting the 
p points in the m-space are expressible in terms of the lengths of the vectors PQ; and 
of the angles between them; and the theory of partial correlation and regression is 
thus exhibited as formally identical with the trigonometry of an n-dimensional con- 


stellation of points. 
Y 
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27.19 ‘The reader who prefers the geometrical way of looking at this branch of 
the subject will have no difficulty in translating the foregoing equations into trigono- 
metrical terminology. We will indicate only the more important results required for 
later sampling investigations. 

Note in the first place that the p points QO; and the point P determine (except perhaps 
in degenerate cases) a sub-space of p dimensions in the m-space. Consider the point 
O,2...» Whose co-ordinates are the m residuals x,2...». In virtue of (27.46) the vector 
PQ,s...» is orthogonal to each of the vectors PQ,,..., PQ, and hence to the space 
of (p—1) dimensions spanned by P, Q,,..., Qp. 

Consider now the residual vectors Q,,, Qo,, where r represents the secondary 
subscripts 3, 4,..., (p—1). The cosine of the angle between them, say 0, is pi2, 
and each is orthogonal to the space spanned by P, Q3,..., Qjp_1). In Fig. 27.1, let 
M be the foot of the perpendicular from Q,, on to PQ, and Q3, a point on PQ,, 
such that Q3,M is also perpendicular to PQ,. Then MQ,, and MQ:, are orthogonal 


Rp 


Fig. 27.1—The geometry of partial correlation 


to the space spanned by P, Q3,..., Q,, and the cosine of the angle between them, 
say 4, iS Piz.» Thus, to express pier) in terms of pi2,, we have to express in terms 
of 6, or the angle between the vectors PQ,, and PQ:, in terms of that between their 
projections on the hyperplane perpendicular to PQ,. We now drop the prime in 
O2, for convenience. By Pythagoras’ theorem, 


(Qi, Q>2,)? = PQ}, + PQ3, = 2PQ,, . PQ,, cos 6 
= MOQi,+MQ3,—2MQ,,.MQz2,cos ¢. 


Further, 
PO?, = PM?+MOQ}, 


and 


PQ, = PM? + MQ3,, 
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and hence we find 
MOQ,, MQ,, COs d = PQ,, PQ:, cos 6— PM? 


or 
MOQ,, MQz, PM PM 
: =Cosge = cosy ——__- —.___3 27.49 
an Wao PO,, POs, a 
Now MO. and bas are the sine and cosine of the angle between PQ, and PQ,,,. 
P O1, Vi O1, 


Since PQ,, is orthogonal to the space spanned by P, Q3,..., Q,-1, its angle with 
PQ, is unchanged if the latter is projected orthogonally to that space, i.e. if we replace 
PQ, by PQ,,. The cosine of the angle between PQ,, and PQ), is pip,, and hence 


PM MQi+ _ (1—p7,,)*. The same result holds with the suffix 2 replacing 


PO, = Pip.ry ron 


1. Thus, substituting in (27.49), 


<s P12.r — Pip.r P2p.r 
P12.rp {(1 es) (1 me ps) ya? (27.50) 
which is (27.39) again. We thus see that the expression of a partial correlation in 
terms of that of next lower order may be represented as the projection of an angle in 
the sample space on to a subspace orthogonal to the variable held fixed in the higher- 
order coefficient alone. 


Computation of coefficients 

27.20 Where there are only 3 or 4 variates, we may proceed to calculate the partial 
correlations and regressions directly from the zero-order coefficients, using the appro- 
priate one of the formulae we have derived. When larger numbers of variables are 
involved, it is as well to systematize the arithmetic in determinantal form. In effect, 
we need to evaluate all the minors of C, the correlation matrix, and then formulae 
(27.6), (27.19) and (27.26) applied to them give us the correlation and regression 
coefficients and residual (or error) variances of all orders. Now that electronic com- 
puting facilities are becoming widely available, the tedium of manual calculation can 
be avoided. | 

For p small, tables of quantities such as 1—p?, (1—p*)? and {(1—pj)(1—p3)}> 
are useful. ‘T'rigonometrical tables are also useful ; for instance, given p we can find 
6 = arccosp and hence sin# = (1—p?)?, cosecO = (1—p?)-?, and so on. 

The Kelley Statistical Tables (Harvard U.P., 1948) give (1—p?)? for 

p = 0-0001 (0-0001) 0-9999. 

The two examples which follow are of interpretational, as well as computational, 

interest. 


Example 27.1 


In an investigation into the relationship between weather and crops, Hooker (1907) 
found the following means, standard deviations and correlations between the yields 
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of “ seeds’ hay ” (x,) in cwt per acre, the spring rainfall («,) in inches, and the accumu- 
lated temperature above 42° F in the spring (x,) for an English area over 20 years :— 


a, = 202, 0, S442 9,5 = +080 
u,= 491, og=1-10, py, = —0-40, 
us = 594, = os fia 056. 


The question of primary interest here is the influence of weather on crop yields, 
and.we consider only the regression of x, on the other two variates. From the correla- 
tions of zero order, it appears that yield and rainfall are positively correlated but that 
yield and accumulated spring temperature are negatively correlated. ‘Ihe question is, 
what interpretation is to be placed on this latter result? Does high temperature 
adversely affect yields or may the negative correlation be due to the fact that high 
temperature involves less rain, so that the beneficial effect of warmth is more than offset 
by the harmful effect of drought? 

To throw some light on this question, let us calculate the partial correlations. 
From (27.5) we have 


123 = P12— P13P23 
{ (1 = pis) (1 Se p33) : 
_ 0:80 -(—0-40)(—0'56) 
{(1 — 0-40?) (1 — 0-56?) } 
eA) 759. 
Similarly 
Pigg = 00ST, “Pasi = — 0-456. 


We next require the regressions and the error variances. We have 


Bi2.3 = 


COV (X13, X23) 
Var X%o3 


= f12.3-—- 
3 


This, however, involves the calculation of o;,3 and 02,3, which are not in themselves 
of interest. We can avoid these calculations by noting from (27.33) that 


0123 = 013(1 ty Zick 
0213 = 923 (1 — piz3)'s ( 
so that 


Bing = Pizs a (27.52) 


2.13 


The standard deviations 0,53 and 0213 are of some interest and may be calculated 


from (27.33). We have 
0123 = 01{(1—piz) (1 — pise) 
01 {(1—pis) (1 —pizs) }*, 
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the two forms offering a check on each other. From the first we have 
0123 = 4:42 {(1—0-8?) (1 — 0-097?) 3 


= 2-64. 
Similarly 
0913 = 0-594, 0312 = 70:1. 
Thus 
2°64 
Biss = 0-759 0-594 = 3 SF; 


and we also find 
B13.9 a 0:00364. 
The regression equation of x, on x, and x; is then 
x,—28-02 = 3-37 (x,—4-91) + 0-00364 (~,— 594). 

This equation shows that for increasing rainfall the yield increases, and that for 
increasing temperature the yield also increases, other things being equal. It enables us 
to isolate the effects of rainfall from those of temperature and to study each separately. 
The fact that f,32 is positive means that there is a positive relation between yield and 
temperature when the effect of rainfall is eliminated. ‘The partial correlations tell the 
same story. Although p,, is negative, p32 is positive (though small), indicating that 
the negative value of p,; is due to complications introduced by the rainfall factor. 

The foregoing procedure avoids the use of determinantal arithmetic, but the latter 
may be used if preferred. (27.2) is 


! 0-80 —0-40 

LC] =|. 0:30 1  —0-56| = 0-2448, 

—0:-40 —0-56 1 

1 — 0-56 

Cui = _0-56 1 = 0-6864, 

from which, for example, by (27.34), 
Gipsy = Si a) = 2-64, as before. 
Ci 


Example 27.2 


In some investigations into the variation of crime in 16 large cities in the U.S.A., 
Ogburn (1935) found a correlation of —0-14 between crime rate (x,) as measured by 
the number of known offences per thousand inhabitants and church membership (xs) 
as measured by the number of church members of 13 years of age or over per 100 of 
total population of 13 years of age or over. The obvious inference is that religious 
belief acts as a deterrent to crime. Let us consider this more closely. 


If x, = percentage of male inhabitants, 
x3 = percentage of total inhabitants who are foreign-born males, and 


x, = number of children under 5 years old per 1000 married women between 


15 and 44 years old, 
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Ogburn finds the values 

Pig = +044, po, = —0-19, 

Pis3 = —0°34, pos = —0°35, 

Pia = —0°31, pg, = +0-4, 

Pis = —0-14, P35 = +0°33, 

Pes = +025, pas = +085. 
From these and other data given in his paper it may be shown that we have, for the 
regression of x, on the other four variates, 


x,—19-9 = 4-51 (x,—49-2) — 0-88 («3 — 30-2) — 0-072 (x,—4814) +.0-63 (x; — 41-6), 


and for certain partial correlations 


P15.3 = — 0-03, 
Pis.4 = +0°25, 
Pisz4 = +0-23. 


Now we note from the regression equation that when the other factors are constant 
x, and x; are positively related, ie. church membership appears to be positively 
associated with crime. How does this effect come to be masked so as to give a negative 
correlation in the coefficient of zero order pj; ? 

We note in the first place that the correlation between crime and church membership 
when the effect of x3, the percentage of foreigners, is eliminated, is near zero. The 
correlation when x,, the number of young children, is eliminated, is positive; and 
the correlation when both x, and x, are eliminated is again positive. It appears, in 
fact, from the regression equation that a high percentage of foreigners and a high 
proportion of children are negatively associated with the crime-rate. Now both these 
factors are positively correlated with church membership (foreign immigrants being 
mainly Catholic and more fecund). These correlations submerge the positive associa- 
tion with crime of church membership among other members of the population. The 
apparently negative association of church membership with crime appears to be due 
to the more law-abiding spirit of the foreign immigrants and the fact that they are also 
more zealous churchmen. 

The reader may care to refer to Ogburn’s paper for a more complete discussion. 


Sampling distributions of partial correlation and regression coefficients in the 
normal case 
27.21 We now consider the sampling distributions of the partial correlation and 
regression coefficients in the normal case. 
For large samples, the standard errors appropriate to zero-order coefficients 
(cf. 26.13) may be used with obvious adjustments. Writing m for a set of secondary 
subscripts, we have, from (26.24), 


vat iem = =(1—plam)® (27.53) 
and from (26.30) 
ie * ] Oi 2 pee 1 Of 2m 
var Dine —_ oe Piz.m) == N Oey,” (27.54) 
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by (27.32). The proof of (27.53) and (27.54) by the direct methods of Chapter 10 is 
very tedious. They follow directly, however, from noting that the joint distribution 
of any two errors x;,, and x2, is bivariate normal with correlation coefficient p45 m. 
It follows, as Yule (1907) pointed out, that the sample correlation and regressions 
between the corresponding residuals have at least the large-sample distribution of a 
zero-order coefficient. We shall see in 27.22 that in a sample of size n, the exact 
distribution of 7;2,, is that of a zero-order correlation based on (n—d) observations, 
where d is the number of secondary subscripts in m. However, since (27.53) and 
(27.54) are correct only to order n-1, we need not adjust them by this small factor. 


27.22 Consider now the geometrical representation of 27.18-19. Suppose that 
we have three vectors PO,, PO,, PQ, representing m observations on x,, #2, x3. As 
we saw in 27.19, the partial correlation 7,23 is the cosine of the angle between PO, 
and PQ, projected on to the subspace orthogonal to PQ3, which is of dimension (n— 1). 
If we make an orthogonal transformation (i.e. a rotation of the co-ordinate axes), the 
correlations, being cosines of angles, are unaffected ; moreover, if the ” original observa- 
tions on the three variables are independent of each other, the m observations on the 
orthogonally transformed variables will also be. (This is a generalization of the result 
of Examples 11.2 and 11.3 and of 15.27 for independent x,, x, x3, and its proof is left 
for the reader as Exercise 27.7 ; it is geometrically obvious from the radial symmetry 
of the standardized multinormal distribution.) If PQ, is taken as one of the new 
co-ordinate axes in the orthogonal transformation, the distribution of rj23 is at once 
seen to be the same as that of a zero-order coefficient based on (n—1) independent 
observations. By repeated application of this argument, it follows that the distribution 
of a correlation coefficient of order d based on observations is that of a zero-order 
coefficient based on (n—d) observations: each secondary subscript involves a projec- 
tion in the sample space orthogonal to that variable and a loss of one degree of freedom. 
The result is due to Fisher (1924a). 

The results of the previous chapter are thus immediately applicable to partial 
correlations, with this adjustment. If d is small compared with n, the distribution 
of partial correlations as m increases is effectively the same as that of zero-order 
coefficients, confirming the approximation (27.53) to the standard error. 

It also follows for partial regression coefficients that the zero-order coefficient 
distribution (16.86) persists when the set m of secondary subscripts is adjoined through- 
out, with 2 replaced by (n—d). In particular, the ‘‘ Student’s ” distribution of (26.38) 
becomes, for the regression of x, on x,, that of 


i = (Di2.m = B12.m) ea (” mn dz a . 


fu = Di's.m $3.m ES 
with (n—d—2) degrees of freedom. If the set m consists of all (p—2) other variates, 
there are (n—p) degrees of freedom. Since the regression coefficients are functions of 
distances (variances) as well as angles in the sample space, the distribution of bj, itself, 
unlike that of 7, is not directly preserved under projection with only degrees of freedom 
being reduced; the statistics Sim, 53m in (27.55) make the necessary “ distance ”’ 
adjustments for the projections. 
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The multiple correlation coefficient 

27.23 ‘The variance in the population of x, about its regression on the other variates 
(27.18) is ofs...», defined in 27.9. We now define the multiple correlation coefficient 
Ris...» between x, and x%2,..., %, by 


1— Rig...» = %2.,.p/%- (27.56) 


From (27.56) and (27.34), 
O< R< 1. 


We shall define R as the positive square root of R?: it is always non-negative. R is 
evidently not symmetric in its subscripts, and it is, indeed, a measure of the dependence 
6f N, BROR 2s, 51> Mp 

To justify its name, we have to show that it is in fact a correlation coefficient. We 
have, from 27.9, 


O12...» = E(Xt2...p); (27.57) 
and by the population analogue of (27.47), 
E(Xis 5) = EQ Xs 5): (27.58) 
(27.57) and (27.58) give, since E(x;2...») = 0, 
ote.p = var(%,2,..») = COv(%y 12...» (27.59) 


If we now consider the correlation between x, and its conditional expectation 
E(x | X., cit Xy) = X%1—- X12... 09 
we find that this is | 


COV (*1, %1— X12...) var X,;—COV(X,, X12...) 
{var x, var (*,—H12...»)} [var x, {var x1 +var x 2,,.»—2Cov(*1, X12...p) }]* 
and using (27.59) this is 
O1— O12... Of —Oi2...»\? 
oHbeae oma : ohdeoee i R 27.60 
(R(oi-ofs.)F Lf = sedate 


by (27.56). Thus Rji2.,.,) is the ordinary product-moment correlation coefficient 
between x, and the conditional expectation E'(x,|x.,...,*,). Since the sum of squared 
errors (and therefore their mean oj,__.,) is minimized in finding the Least Squares 
regression, which is identical with E'(x,|%.,..., %») (cf. 27.15), it follows from (27.60) 
that Ri...» is the correlation between x, and the “ best-fitting ”’ linear combination 
of x,...,%,. No other linear function of x,,...,x, will have greater correlation 
with x. 


27.24 From (27.56) and (27.34), we have 


C 
Sea oe 


= Ri... ees _. 


(*) We use a bold-face R for the population coefficient, and will later use an ordinary capital 
R for the corresponding sample coefficient: we are reluctant to use the Greek capital for the 
population coefficient, in accordance with our usual convention, because it resembles a capital 
P, which might be confusing. 
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expressing the multiple correlation coefficient in terms of the correlation determinant 
or of the partial correlations. Since permutation of the subscripts other than 1 1s 
allowed in (27.34), it follows at once from (27.61) that, since each factor on the right 
is in the interval (0, 1), 
1— Rie...» < 1— pips 

where p,;, is any partial or zero-order coefficient having 1 as a primary subscript. 
Thus 

Ri 2... ») 2 | prz.s | ) (27.62) 
the multiple correlation coefficient is no less in value than the absolute value of any 
correlation coefficient with a common primary subscript. It follows that if R,2...») = 0, 
all the corresponding p;;; = 0 also, so that x, is completely uncorrelated with all the 
other variables. On the other hand, if Ri...» = 1, at least one p,;, must be 1 also 
to make the right-hand side of (27.61) equal to zero. In this case, (27.56) shows that. 
o?2..» = 0, so that all points in the distribution of x, lie on the regression line, and 
x, 1s a strict linear function of x.,..., Xp. 

Thus Rj...» is a measure of the linear dependence O42, Uupol x... Xo 


27.25 So far, we have considered the multiple correlation coefficient between x, 
and all the other variates, but we may evidently also consider the multiple correlation 
of x, and any subset. Thus we define 

2 
Rosi (27.63) 
O71 
for any set of subscripts s. It now follows immediately from (27.34) that 
Cis < Cir, (27.64) 


where r is any subset of s: the error variance cannot be increased by the addition 
of a further variate. We thus have, from (27.63) and (27.64), relations of the type 


Row < Rives, < Riess < +--+ < Rig...» (27.65) 
expressing the fact that the multiple correlation coefficient can never be reduced by 
adding to the set of variables upon which the dependence of x, is to be measured. 

In the particular case p = 2, we have from (27.61) 
Rie = ph, (27.66) 
so that Rj) is the absolute value of the ordinary correlation coefficient between x, 
and Xo. 


Geometrical interpretation of multiple correlation 

27.26 We may interpret Rj...» in the geometrical terms of 27.18-19. Consider 
first the interpretation of the Least Squares regression (27.18): by 27.23, it is that 
linear function of the variables x.,..., *, which minimizes the sum of squares (27.40). 
Thus we choose the vector PV in the (p—1)-dimensional sub-space spanned by P, 
Q,,..., Oy, which minimizes the distance Q,V, i.e. which minimizes the angle between 
PQ, and PV. By (27.60), Ri...p) is the cosine of this minimized angle. But this 
means that Rj 2...» is the cosine of the angle between PQ, and the (p—1)-dimensional 
subspace itself, for otherwise the angle would not be minimized. 
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If Ri...» = 0, PQ, is orthogonal to the (p—1)-subspace so that x, is uncorrelated 
with x,..., *, and with any linear function of them. If, on the other hand, 
Riz...» = 1, PQ, lies in the (p—1)-subspace, so that x, is a strict linear function of 
No,.++,Xy. These are the results we obtained in 27.24. 

We shall find this geometrical interpretation helpful in deriving the distribution of 
the sample coefficient Ri (2,,.») in the normal case. It is a direct generalization of the 
representation used in 16.24 to obtain the distribution of the ordinary product-moment 
correlation coefficient r which, as we observed at (27.66), is essentially the signed value 
of R, (2)* 


The screening of variables in investigatory work 


27.27 Innew fields of research, a preliminary investigation of the relations between 
variables often begins with the calculation of the zero-order correlations between all 
possible pairs of variables, giving the correlation matrix C. If we are only interested 
in “ predicting ” the value of one variable, x,, from the others, it is tempting first to 
calculate only the correlations of x, with the others, and to discard those variables with 
which it has zero or very small correlations : this would perhaps be done as a means of 
reducing the number of variables to a manageable figure. ‘The next stage would be to 
calculate the correlation matrix of the retained variables and the multiple correlations 
of x, on combinations of the remaining variables. 

Unfortunately, this procedure may be seriously misleading. Whilst it is perfectly 
true that the whole set of zero-order correlations completely determine the whole com- 
plex of partial correlations, it is not true that small zero-order coefficients of x, with 
other variables guarantee small higher-order coefficients, and this is so even if we ignore 
sampling considerations. Since by (27.62) the multiple correlation must be as great 
as the largest correlation of any order, we may be throwing away valuable information 
by the “screening ’’ procedure described above. 

Consider (27.5) again: 

= P13 PisP2s 
M28“ {(1 = pi) (1— ps) sebedis 
If py. and py OF pos are zero, SO is pyos: if prs = pes = 0, piss = Piz But pis and 
P13 can both be very small while p23 is very large. In fact, suppose that py; = 0. Then 


(27.67) becomes 
Piz3 = Pi2/(1 — pbs)", Pis = 0. (27.68) 
If p,. is very small and p2, is very large, (27.68) can be large, too. To consider a 
specific example, let 
pis = 9, 
cos 0, 


> 

_ 

bo 
l 


= cos(4a— 6)=sin 0. 


e 

bo 

oo 
| 


Then (27.68) becomes 
Ping = I. 
A similar result occurs if we put 
Pes = cos(37+6), 
for then p33 is unchanged. 
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Now we may make cos @ (or cos (42+) ) as small as we like, say «. ‘Thus we have 


Pis = 0, 
Piz = &,; (27.69) 
Pio.3 = 1. 
Since the multiple correlation Rj(23) > | pi2.3| by (27.62), we have in this case 
R; 3) —= 1. (27.70) 


By (27.70), x, is a perfect linear function of x, and x3, despite the values of the zero- 
order coefficients in (27.69). We should clearly have been unwise to discard x, and x, 
as predictors of x, on the evidence of the zero-order correlations alone. 

It is easy to see what has happened here in geometrical terms. The vector PQ, 
is orthogonal to PQ, and almost orthogonal (making an angle 6 near 47) to PQ,, but 
all three vectors lie in the same plane, in which PQ, and PQ, are either at an angle 
(37 — @) to each other (when cos ($2 —6) is very near 1) or at an angle (47 +6) to each 
other (when cos(}z+6) is very near —1). 

We have been considering a simple example, but the same argument applies a fortiori 
with more variables, where there is more room for relationships of this kind to appear. 
The value of R depends on ail the partial correlations. 

Fortunately for human impatience, life has a habit of being less complicated than 
it need be, and we usually escape the worst possible consequences of simplifying 
procedures for the selection of ‘‘ predictor’ variables ; we usually have enough back- 
ground knowledge, even in new fields, to help us to avoid the more egregious oversights, 
but the logical difficulty remains. 


The sample multiple correlation coefficient and its conditional distribution 
27.28 We now define the sample anaiogue of Ri...» by 
1-Re_») = Bs 2D (27.71) 
1 
and all the relations of 27.23-6 hold with the appropriate substitutions of r for p, and 
s for o. We proceed to discuss the sampling distribution of R* in detail. Since, by 
27.23, it is a correlation coefficient, whose value is independent of location and scale, 
its distribution will be free of location and scale parameters. — 
First, consider the conditional distribution of R? when the values of x,,..., x, are 
fixed. As at (26.50) we write the identity 
ns} = ns} Rive...» +ns{(1 —Rig...») 
=" (si—Sie..») + NSi2...p ’ (27.72) 
by (27.71). If the observations on x, are independent standardized normal variates, 
so that Ri...» = 0, the left-hand side of (27.72) is distributed in the chi-squared form 
with (n—1) degrees of freedom, and the quadratic forms in x, on the right of (27.72) 
may be shown to have ranks (p—1) and (n—p) respectively. It follows by Cochran’s 
theorem (15.16) that they are independently distributed in the chi-squared form with 
these degrees of freedom and that the ratio 


as Rive. »/(P 1) 
1 (i= = Ria. ae p) OT 


338 THE ADVANCED THEORY OF STATISTICS 


has the F distribution with (p—1, n—p) degrees of freedom, a result first given by 
Fisher (1924b). (26.52) is the special case of (27.73) for p = 2, when Ri) = 7iz 
(cf. (27.66) ). 

This is another example of a LR test of a linear hypothesis. We postulate that 
the mean of the observations of x, is a linear function of (p—1) other variables, with 
(p—1) coefficients and a constant term, p parameters in all. We test the hypothesis 
that all (p—1) coefficients are zero, ie. Hy): R* = 0. In the notation of 24.27-8, 
we have k = p,r = p—1so that the F-test (27.73) has (p—1, n—p) degrees of freedom, 
as we have seen. It follows immediately from 24.32-3 that when H, is not true, 
F at (27.73) has a non-central F-distribution with degrees of freedom p—1 and n—p, 
and non-central parameter 4 = nR*, and the power properties of the LR test, given 
in Chapter 24, apply here. In particular, the test is UMP invariant by 24.36-7. 


The multinormal (unconditional) case 

27.29 If we now allow the values of x,,..., %, to vary also, and suppose that we 
are sampling from a multinormal population, we find that the distribution of R? is 
unchanged if R? = 0, but quite different otherwise from that of R? with x2,..., %» 
fixed. Thus the power function of the test of R* = 0 is different in the two cases, 
although the same test is valid in each case. As n —> o, however, the results are 
identical in both situations. 

We derive the multinormal result for R? = 0 geometrically, and proceed to gener- 
alize it in 27.30. | 

Consider the geometrical representation of 27.26. R is the cosine of the angle, 
say 0, between PQ, (the x,-vector) and the vector PV, in the (p—1)-dimensional space 
S,—1 of the other variables, which makes the minimum angle with PQ,. If the parent 
R = 0, x, is, since the population is multinormal, independent of x2, ..., x,, and the 
vector PO, will then, because of the radial symmetry of the normal distribution, be 
randomly directed with respect to S,_1, which we may therefore regard as fixed in the 
subsequent argument. (We therefore see how it is that the conditional and uncon- 
ditional results coincide when R? = 0.) 

We have to consider the relative probabilities with which different values of 0 
may arise. For fixed variance s?, the probability density of the sample of m observa- 
tions is constant upon the (n—2)-dimensional surface of an (m—1)-dimensional 
hypersphere. If 6 and PY are fixed, PQ, is constrained to lie upon a hypersphere of 
(n—2)—(p—1) = (n—p—1) dimensions, whose content is proportional to (sin 0)"~?— 
(cf. 16.24). Now consider what happens when PV varies. PV is free to vary within 
S.,—1) where by radial symmetry it will be equiprobable on the (p — 2)-dimensional surface 
of a(p—1)-sphere. This surface has content proportional to (cos@)?-*. For fixed 0, 
therefore, we have the probability element (sin 6)"-?-1(cos 0)?-?d0. Putting R = cos8, 
and d@ oc d(R*)/{R(1—R?)?}, we find for the distribution of R* the Beta distribution 


dF oc (R®)!?-9)(1— R2)8-?-2 d(R2), = O< R21. (27.74) 
1 


. °° BEI), EP) 
formation (27.73) applied to (27.74) then gives us exactly the same F-distribution as 


The constant of integration is easily seen to be The trans- 
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that derived for x,,...,%, fixed in 27.28. When p = 2, (27.74) reduces to (16.62), 
which is expressed in terms of dR rather than d(R?). 


27.30 We now turn to the case when R 4 0. The distribution of R in this case 
was first given by Fisher (1928a) by a considerable development of the geometrical 
argument of 27.29. We give a much simpler derivation due to Moran (1950). 

We may write (27.61) for the sample coefficient as 


1— Rie...» = (1-7.) 1 -T?), (27.75) 


say, where J is the multiple correlation coefficient between x, 2 and x30, X42, . ++) Xp.2- 
Now Ri...) and the distribution of Ri2,,.,) are unaffected if we make an orthogonal 
transformation of x.,..., x, so that x, itself is the linear function of x,,..., %, which 
has maximum correlation with x, in the population, i.e. py, = Ri...» It then follows 


from (27.61) that 

Pi3.2 = Pi4.23 wee eas Pip.23...(p—1) = 0, (27.76) 
and since subscripts other than 1 may be permuted in (27.61), it follows that all partial 
coefficients of form p;;2, = 0. Thus x2 is uncorrelated with (and since the variation 
is normal, independent of) x3, %4.9,..+ 5 %p2, and T in (27.75) is distributed as a 
multiple correlation coefficient, based on (m—1) observations (since we lose one dimen- 
sion by projection for the residuals), between one variate and (p—2) others, with the 
parent R= 0. Moreover, T is distributed independently of r,,, for all the variates 
x; are orthogonal to x, by (27.46). Thus the two factors on the right of (27.75) are 
independently distributed. The distribution of 7,,, say f,(r), is (16.60) with 
p = R,o.. », integrated for B over its range, while that of T?, say f,(W*), is (27.74) 
with 2 and p each reduced by 1. We therefore have from (27.75) the distribution 


of R® 
dF = 2 Af (T=) har (27.77) 


which, dropping all suffixes for convenience, is 


= 2) 2\i(n—1 See ee dp 
tee 2 if Cees o (cosh B— Rr)"-} 


<{ 1 R2-—r YE) IR? 1 
ee - SS Y 
Bis (p—2), 3(n— aie = ce a=} 


_ (n—2) C= ee ee 2 y2\Kp—4)| | dp a 
Tt Bi{s(p—2), 3(n—p)} — C LY, (cosh B — RF : 
(27.78) 


If in (27.78) we put r = Rcos y and write the integral with respect to 8 from — © to 0, 
dividing by 2 to compensate for this, we obtain Fisher’s form of the distribution, 


ae TP (gn) (1— Ray 2 2\3(p—3 2\3 (n—p—2 2 
gee Ts ORES) nash cae ee ma al, 


uaeet 00 dp 
p—3 LE eI ERR Oe 
" sin? > y {| _giteshp oR Rey) , dy. (27.79) 
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27.31 ‘The distribution (27.79) may be expressed as a hypergeometric function. 
Expanding the integrand in a uniformly convergent series of powers of cos y, it becomes, 
since odd powers of cosy will vanish on integration from 0 to z, 

oO (n+2j—2\ sin?-? pcos*® yp = 
2a) ears RR 
and since 


| cos psin?-* ydp = Bf (p—2), 4 (2+ 1)} 


and 


|. oak gynava = BU 10+2)-0)}, 


the integral in (27.79) becomes 


: ("t*) Ba (P-2), §(2j+1)} BEd $(u+2j—1)}(RR)*, 


and on writing this out in terms of Gamma functions and simplifying, it becomes 


j=0 


al {3 (p—2)}P {3 (n—-1)} 
= * F{i(n—1), $(n—1), $(p—1), R? R?}. 27.80 
Substituting (27.80) for the integrand in (27.79), we obtain 
R?)3(P—8) (1 — R?)?("—P—2) d(R?) pie! 
— ~ (1—R?)}%—-) Ff (n—1), 
BGe-1.i@-py ED 
3(n—1), 3(p—1), R°R®}. (27.81) 
This unconditional distribution should be compared with the conditional distribution 
of R?, easily obtained from the non-central F distribution in 27.28, given in Exercise 
27.13. Exercise 27.14 shows that as n—> o, both yield a non-central y? distribution 
for nR?. 

The first factor on the right of (27.81) is the distribution (27.74) when R = 0, the 
second factor then being unity. When p = 2, (27.81) is not so rapidly convergent a 
series for r? as (16.66) is for 7, and generally it converges slowly, for the first two argu- 
ments in the hypergeometric function are $}(m—1). In the search for a more rapidly 
convergent expression, we are tempted to substitute for the integral with respect to 
B in (27.78) the expression (16.65), which is 

z ee ee 
o (cosh B—Rr)*-1 —-2#(1— Rr)"-32 
and since F(a, b, c, x) = (1—x)*-* °F (c—a, c—b, c, x) this is 
_ Bla, n—1) 
- > gal 


But when we substitute (27.82) into (27.78), the integration with respect to r does 
not seem to lead to any more tractable result than (27.81). 


ee 


Fig, :, a4, 3(1+Rr)}, 


F{n—1,n—-1, n-}, 4(1+Rr)}. (27.82) 
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The moments and limiting distributions of R? 


27.32 It may be shown (cf. Wishart (1931)) that the mean value of R? in the 
multinormal case is 


E(R®) = 1-"—F(1—R*)F{1, 1, 4(0+1), R}, 


(27.83) 
= R4PoI —- a i =P) Re(1-R)+0(-,) 
(n?—1) n 
In particular, when R? = 0, (27.83) reduces to 
E(R?|R? = 0) = mane (27.84) 
also obtainable directly from (27.74). 
Similarly, the variance may be shown to be 
arte . C= eS f +2) (4 _ R22 F(2, 2, 4(n43), R®)— {E(R)—1}? (27.85) 
gp MOTD) hes ‘| 14 tt —p) (@—1)+4(P- DF, of (R* +I. 
@-Ne-) Le in+3) pales 
(27.86) 


Vv 2 - + ( iY enue 7. / 


var (R?) ~ 4R?(1— R?*)?/n. (27.88) 
But if R? = 0, (27.87) is of no use, and we return to (27.86), finding 
2(n—p)(p—1) 
Se 2 oe Ss eer ee 2 
var (R?) (nt—1)(n—1) 2(p—1)/n?’, (27.89) 


the exact result in (27.89) being obtainable from (27.74). 


27.33 The different orders of magnitude of the asymptotic variances (27.88) and 
(27.89) when R # 0 and R = 0 reflect the fundamentally different behaviour of the 
distribution of R? in the two circumstances. Although (27.84) shows that R? is a 
biassed estimator of R?, it is clearly consistent; for large n, E(R®)—> R® and 
var(R?)—>0. When R # 0, the distribution of R? is asymptotically normal with mean 
R? and variance given by (27.88) (cf. Exercise 27.15). When R = 0, however, R, 
which is confined to the interval (0, 1), is converging to the value 0 at the lower extreme 
of its range, and this alone is enough to show that its distribution is not normal in 
this case (cf. Exercises 27.14-15). It is no surprise in these circumstances that its 
variance is of order n-?:: the situation is analogous to the estimation of a terminal of a 
distribution with finite range, where we saw in Exercises 14.8, 14.13, 14.16 that variances 
of order n-* occur. 
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The distribution of R behaves similarly in respect of its limiting normality to that 
of R2, though we shall see that its variance is always of order 1/n. 

One direct consequence of the singularity in the distribution of R at R? = 0 should 
be mentioned. It follows from (27.88) that 


var R ~ (1— R?*)?/n, (27.90) 


which is the same as the asymptotic expression for the variance of the product-moment 

correlation coefficient (cf. (26.24) ) 
3 varr ~ (1—p?)?/n. 

It is natural to apply the variance-stabilizing z-transformation of 16.33 (cf. also Exercise 

16.18) to R also, obtaining a transformed variable z = ar tanh R with variance close 

to 1/n, independent of the value of R. But this will not do near R = 0, as Hotelling 


(1953) pointed out, since (27.90) breaks down there; its asymptotic variance then 
will be given by (27.84) as 

var R = E(R*)— {E(R)}? ~ (p-1)/n, (27.91) 
as against the value 1/n obtained from (27.90). For p = 2 (when R = |r|), all is 
well. Otherwise, we may only use the z-transformation of R for values of R bounded 
away from zero. 


Unbiassed estimation of R?2 in the multinormal case 


27.34 Since, by (27.83), R? is a biassed estimator of R?, we may wish to adjust it 
for the bias. Olkin and Pratt (1958) show that an unbiassed estimator of Rie...» is 


= 1-2 (1- Rie») FU, Ld(n—p42,1- Ba ae (27.92) 


where n > p > 3. t is the unique unbiassed function of R? since it is a function of 
the complete sufficient statistics. (27.92) may be expanded into series as 


_ prx_P-31_pn_f[ 2 ("—3) _pe+o(t 
t= RP RY ‘won et +0()t (27.93) 
whence it follows that t < R®%. If R? = 1, t= 1 also. When R? is zero or small, on 
the other hand, ¢ is negative, as we might expect. We cannot find an unbiassed 
estimator of R? (i.e. an estimator whose expectation is R? whatever the true value of R*) 
which takes only non-negative values, even though we know that R? is non-negative. 
We may remove the absurdity of negative estimates by using as our estimator 


t’ = max(t, 0) (27.94) 
but (27.94) is no longer unbiassed. 


27.35 Lehmann (1959) shows that for testing R? = 0 in the multinormal case, tests 
rejecting large values of R® are UMP among test statistics which are invariant under 
location and scale changes. Ezekiel and Fox (1959) and Kramer (1963) give charts 
and tables for constructing confidence intervals for R? from the value of R?. 
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EXERCISES 


27.1 Show that 


Bi2.34...p + Bip.23...(p—1) Pp2.18...(p—1), 


B12.34...(p—1) = 
(p—}) 1 — Pip.23...(p—1) Bp1.28...(p—1) 


and that 
P12.34...p + Plp.23...(p—1) P2p.18...(p—1)_. 


P12.34,..(p—1) = °) 7) rl 
{(1—pip.2s...(p—1)) (1 — pap.as...(p—1)} 


(Yule, 1907) 


27.2 Show that for p variates there are (2) correlation coefficients of order zero 
and & = a ) of order s. Show further that there are if )ar-2 correlation co- 
s 2 2 


efficients altogether and (2)29— regression coefficients. 


27.3 If the correlations of zero order among a set of variables are all equal to p, 


show that every partial correlation of the sth order is equal to co 


27.4 Prove equation (27.27), and show that it implies that the coefficient of x; x in 
the exponent of the multinormal distribution of x1, %2, ... , Xp 18 1/cov (%1.q,) %mgn)- 


27.5 Show from (27.46) that in summing the product of two residuals, any or all 
of the secondary subscripts may be omitted from a residual all of whose secondary sub- 
scripts are included among those of the other residual, i.e. that 


py X1 stu X2.st = p> X1.stu%2.3 = p> X1 stu 25 


but that 
Di x1 stu X2.c¢ F X19 X2.8ty 


where s, t, u are sets of subscripts. 
(Chandler, 1950) 


27.6 By the transformation 


yy =A 

Fe 2345 

Ys = %3.215 
etc. 


show that the multivariate normal distribution may be written 


1 tie se ft 
ae ee tC a : 
(27)?? 61 02.1 03,12 +++ P 2 (3 Gs; OR y a 


so that the residuals x,, x21,... are independent of each other. Hence show that-any 
two residuals x;, and x, (where r is a set of common subscripts) are distributed in the 
bivariate normal form with correlation pjz,. 


27.7 Show that if an orthogonal transformation is applied to a set of m independent 
observations on p multinormal variates, the transformed set of m observations will also be 
independent. 

Z 
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27.8 For the data of Tables 26.1 and 26.2, we saw in Example 26.6 and Exercise 26.1 
that 
Ti2 = 0-34, T13 = 0-07, 
where subscripts 1, 2, 3 refer to Stature, Weight and Bust Girth respectively. Given 
also that 
To3 = 0:86 
show that 
R319) = 0-80, 
indicating that Bust Girth is fairly well determined by a linear function of Stature and 
Weight. 


27.9 Show directly that no linear function of x2, ..., Xp has a higher correlation 
with x, than the Least Squares estimate of x). 


27.10 Establish (27.83), the expression for F(R’). 
(Wishart, 1931) 


27.11. Establish (27.85), the expression for var (R?). 
: (Wishart, 1931) 


27.12 Verify that (27.92) is an unbiassed estimator of R®. 


27.13 Show from the non-central F-distribution of F at (27.73) when R? # 0, that 
the distribution of R? in this case, when xs,..., Xp are fixed, is 
1 
B {k(p-1), 4 (n—)} 
o T(t (m—1+2/) TT 3 (p-1)} Gp)? RY 


x 2 Tanai) WT G@-142)} j! 


dF. = (R2)!(p—8) (1 — R®)? —P—®) dR? .exp {— 3 (n—p)R*} 


(Fisher, 1928a) 


27.14 Show from (27.81) that for n—> oo, p fixed, the distribution of nR? = B? is 
< (B?)2 (P—8) exp (—362—3B4) 
Orgs) 
f° ( 8? B?)? 
14+——.- + +...$d(B’), 
x{ (p—1).2° (P—1)(pF 12.4 } se 


where f? = nR*, and hence that nR?® is a non-central y? variate of form (24.18) with 
y = p—1,4=nR*. Show that the same result holds for the conditional distribution of 


nR?, from Exercise 27.13. 


dF 


(Fisher, 1928a) 


27.15 In Exercise 27.14, use the c.f. of a non-central y? variate given in Exercise 24.1 
to show that as »—> oo for fixed p, R? is asymptotically normally distributed when 
R +0, but not when R = 0. Extend the result to R. 


27.16 Show that the distribution function of R? in multinormal samples may be 
written, if n—p is even, in the form 
Hoe-OT (p—142j)} RY 
PG(@—)} G-RRFO 1H 
x F{—j, —3(n—p),4(p—1),R? RR}. 
(Fisher, 1928a) 


(1 — R?)2(n—1) Re-1 
j=0 
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27.17 Show that in a sample (x,,..., x») of one observation from an n-variate multi- 
ncrmal population with all means yp, all variances o? and all correlations equal to p, the 
statistic 


er: ——— { i=p : \ 
Sheree} 1+(n—1)p 


has a ‘‘ Student’s ”’ ¢?-distribution with (n—1) degrees of freedom. When p = 0, this 
reduces to the ordinary test of a mean of m independent normal variates. 


(Walsh, 1947) 


27.18 If xo, x1, ..., Xn are normal variates with common variance «?, x1, ..., Xn 
being independent of each other and x, having zero mean and correlation 4 with each 
of the others, show that the nz variates 


Vi = XKi—AXp, fae es 
are multinormally distributed with all correlations equal to 
p = (a*—2ad)/(1+a?—2a/) 
and all variances equal to 


a? = a /(1—p). 
(Stuart, 1958) 


27.19 Use the result of Exercise 27.18 to establish that of Exercise 27.17. 
(Stuart, 1958) 


27.20 Show that if every pair from x2,...,%X, are uncorrelated, 
Pp p 
R3\0.,.0 = 2 pis = & (Bis,0 Fs)? /04- 
gad g=2 


27.21 Generalizing (27.17), show in the matrix notation given at the end of 27.6 
that the conditional mean of the vector (x,,..., x,), when the vector (x441,..., Xp)’ 
is fixed at xp, is B’E-!xp. 

(Marsaglia (1964) shows that this result and (27.14) hold even for 
singular multinormal distributions if E~! is replaced by the pseudo- 
inverse Et= T’(TT’)-?T, where E = T’T.) 


CHAPTER 28 


THE GENERAL THEORY OF REGRESSION 


28.1 In the last two chapters we have developed the theory of linear regression 
of one variable upon one or more others, but our main preoccupation there was with 
the theory of correlation. We now, so to speak, bring the theory of regression to the 
centre of the stage. In this chapter we shall generalize and draw together the results 
of Chapters 26 and 27, and we shall also make use of the theory of Least Squares 
developed in Chapter 19. 

When discussing the regression of y upon one or more variables x, it has been 
customary to call y a “dependent” variable and x the “ independent ” variables. 
This usage, taken over from ordinary algebra, is a bad one, for the x-variables are not 
in general independent of each other in the probability sense ; indeed, we shall see 
that they need not be random variables at all. Further, since the whole purpose of a 
regression analysis is to investigate the dependence of y upon 4%, it is particularly con- 
fusing to call the x-variables “ independent.” Notwithstanding common usage, 
therefore, we shall follow some more recent writers, e.g. Hannan (1956), and call x the 
regressor variables (or regressors, for short). 

We first consider the extension of the analytical theory of regression from the linear 
situations discussed in Chapters 26 and 27. The distinguishing feature of the analytical 
theory is that knowledge of the joint distribution of the variables, or equivalently of 
their joint characteristic function, is assumed. 


The analytical theory of regression 
28.2 Let f(x,y) be the joint frequency function of the variables x, y. Then, for 
any fixed value of x, say X, the mean value of y is defined by 


E(y1X) = | _ vf%s)dy [fade (28.1) 


(28.1) is the regression (curve) discussed in 26.5 ; it gives the relation between X and 
the mean value of y for that value of X, which is a mathematical relationship, not a 


probabilistic one. 
We may also consider the more general regression (curve) of order r, defined by 


pie = EQ X) = (~_ yf% ray / |" fro (28.2) 

which expresses the dependence of the 7th moment of y, for fixed X, upon X. Similarly 
ex = E[{y—E(y|X)F 1X] 

=[" O-EOIN Ie /[" Fre 23) 


gives the dependence of the central moments of y, for fixed X, upon X. 
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If r = 2 in (28.3), it is called the scedastic curve, giving the dependence of the 


2 
variance of y for fixed X upon X. If the skewness coefficient f,y = ee is plotted 
Mex 


Max : 


against .X, we obtain the cliticcurve, and if B,x = ‘~~ is plotted, we have the kurtic curve.\”) 
2 


These are not, in fact, in common use. ‘The regression curve of outstanding importance 
is that for r = 1, which is (28.1); so much so, that whenever “ regression ”’ is men- 
tioned without qualification, the regression of the mean, (28.1), is to be understood. 
As we saw in 26.5, we are sometimes interested in the regression of x upon y as 
well as that of y upon x. We then have the obvious analogues of (28.2) and (28.3), 
and in particular that of (28.1). 
= fe { ete Vide / { fla Vode. (28.4) 
28.3 Just as we can obtain the moments from a c.f. without explicitly evaluating 
the frequency function, so we can find the regression of any order from the joint c.f. 
of x and y without explicitly determining their joint f.f., f(x, y). Write 


F(%y) = g(*)-he(y), (28.5) 


where g(x) is the marginal distribution of x and h,(¥) the conditional distribution of 
y for given x.(t) The joint c.f. of x and y is 


Bltwts) = {~ [exp itsx+itay)g(@)he(y) dxdy (28.6) 
= |" exp(its)g(w) belt) ds (28.7) 
where Salts) = [" exp(itay)he(y)dy 


is the conditional c.f. of y for given x. If the rth moment of y for given x is p,,, as 
in 28.2, we have 


i" Mere = Ee #00) (28.8) 
ots, t,=0 
and hence, from (28.7) and (28.8), 
or Se Se ; 
E re i) = | exp (it, x)g (x) ule de. (28.9) 
2 2= — 0 
Hence, by the Inversion ‘Theorem (4.3), 
pol. oe ae E | 
2(%) Me = a ee tt, x) oer? Wea ta) fo (28.10) 


(28.10) is the required expression, from which the regression of any order may be 
written down. 


(*) Although, so far as we know, such a thing has never been done, it might be more advan- 
tageous to plot the cumulants of y, rather than its moments, against X. 
(1) We now no longer use X for the fixed value of x. 
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From (28.10) with r = 1, we have 


(s)uie = 5 | ss Set ita) | b (ty | i, (28.11) 


28.4 If all cumulants exist, we have the definition of bivariate cumulants at (3.74) 


ree exp{ 3 rs ae ee, 


s| 
where ko9 is defined to be equal to zero. Hence 


la aed iad eee cee =e 


r! 


= i$(t0). = fit (28.12) 
In virtue of (28.12), (28.11) becomes 
ae 3 2 (it, 
gle = a5) exp(—itha) $40) Bead, (28.13) 


and if the interchange of integration and summation operations is permissible, (28.13) 
becomes 


2(x) Mie = a = = 7: ti exp (—7t,x)4(t,,0) dey. (28.14) 
Since, by the Inversion Theorem, 
g(x) = a |. exP(—itsa) $4, 0) dey 
we have, subject to existence conditions, 
(-Dyg() =(-19 S580) = 5 | enp(—itsa)$(4,0)dt, (28.15) 
Using (28.15), (28.14) becomes 


g(*)Mie = = 7 (—Dya(s). (28.16) 
Thus, for the regression of the mean of y on x, we have 
= Kri(—D)'g (x) 
pee 8 > Kra( 28.17 
S r=0 r! g (x) ( 


a result due to Wicksell (1934). (28.17) is valid if cumulants of all orders exist and if 
the interchange of integration and summation in (28.13) is legitimate ; this will be 
so, in particular, if g(x) and all its derivatives are continuous within the range of x and 
zero at its extremes. 

If g(x) is normal and standardized, we have the particular case of (28.17) 


ie EH (x), (28.18) 
where H,(x) is the Tchebycheff-Hermite en of order r, defined at (6.21). 
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Example 28.1 
For the bivariate normal distribution 


f(x,y) = (20,05)-1(1 —p?) exp E niger iy {(s=e3) 


O71 


Be Gare ONY Samal) eae ar | 
O71 Os Os : 
the joint c.f. of =f) and aE is (cf. Example 15.1) 
1 2 
$(t1,t2) = exp{—3(#+44+2pt,t,)}, 
whence 
Kor >= 0, 
Kr, = 0, > 1; 
so that 
2s © See 
is the only non-zero cumulant in (28.17). The marginal distribution g(x) is standard 
normal, so that (28.17) becomes (28.18) and we have, using (6.23), 


Mig = K11,H,(x) = px. 


This is the regression of (y—j.)/o, on (x—,)/o;. If we now de-standardize, we 
find for the regression of y on x, 


o 
E(y|x)—p, = —2(x—,), 


Oy 


a more general form of the first equation in (16.46), which has x and y interchanged 
and 1 = be — 0. 


Example 28.2 


In a sample of observations from the bivariate normal distribution of the previous 
example, consider the joint distribution of 


u = 4% (x—m)2/o2 and v = 43 (y,—p;)*/02. 
i=1 = 
The joint c.f. of u and v is easily found from Example 26.1 to be 
# (tats) = {(1—0,)(1—05)-~p20,0,}-™, (28.19) 


where 0, = it,, 0, =7t,. The joint f.f. of wu and v cannot be expressed in a simple 
form, but we may determine the regressions without it. From (28.19), 


a" dh (ty, ) —- | 5 gee ~ {1—(1—p?)0,}" : 
ae he era. 3.20) 
Thus, from (28.10) and 28.20), 


2(uU) ry = Gntr—1) 5 [ exp (— 6,1) 1p 


24 (1—p?)(1—0,)}" 
(f= ger Pe 82 AS 
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Now, from the inversion of the c.f. in Example 4.4, 
1 y Ss et Sd 


20 (1—6,) T'(R) 
while the marginal distribution g(u) is 2» (11.8) ) 
—U ,,in—1 
g(u) = ras Tn’ _ 
Substituting into (28.21), we find, putting r = 1, 2 successively, 
be = In { poe + (1— pt} = prot In(—p9 (28.22) 
and 
2 
‘ = isfins 1) i= 
fay = an (3n+ 4 iGaiD, p?( - Pan A -#*)} 
= pt? +2p?(1—p*) v(n+1)+(1—p*)? an (an + 1), 
so that 


May = May— (Mio)? = (1—p*) (2p? 0+ 3n (1 — p*)?}. (28.23) 
(28.22) and (28.23) indicate that the regressions upon v of both the mean and variance 
of w are linear. 


Criteria for linearity of regression 
28.5 Let w(t,,t.) = logd(t,,t,) be the joint c.g.f. of x and y. We now prove: 
if the regression of y upon ~ is linear, so that 


Miz = E(y|*) = Bot Aix, (28.24) 
then 
19 “2 0 ee). 


and conversely, if a completeness ee is — (28.25) is sufficient as well as 
necessary for (28.24). 
From (28.9) with r = 1, we have, using (28.24), 


eee E iz exp (Gfx) ata) (PoE 8, x) de (28,26) 
= 1B )¢(t1, 0)+ Biz 7 $(ts 0). (28.27) 
Putting y = log¢d in (28.27), and dividing Nise = (t,,0), we obtain (28.25). 
Conversely, if (28.25) holds, we rewrite it, using (28.9), in the form 
i| exp (it, *) (Bot Bix —pin)2 (x) dx = 0. (28.28) 
We now see that (28.28) implies 


Bot Bit—pMiz = 0 (28.29) 
identically in x if exp (7t,x) g(x) is complete, and hence (28.24) follows. 
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28.6 If all cumulants exist, (28.25) gives, on using (28.12) 


E ey So Sao 5 Are ae (28.30) 
Identifying coefficients of ¢” in (28.30) gives 
(r= 0) Kor = Bot Aiki, (28.31) 
as is obvious from (28.24) ; 
(721) kr = Bi 'r+1,0 (28.32) 


The condition (28.32) for linearity of regression is also due to Wicksell (1934). (28.31) 
and (28.32) together are sufficient, as well as necessary, for (28.25) and thence (given 
the completeness of g (x); as before) for the linearity condition (28.24). 

If we express (28.25) in terms of the c.f. ¢, instead of its logarithm y, as in (28.27), 
and carry through the process leading to (28.32), we find the analogue of (28.32) for 
the central moments, 


bry = Bi fr+1,0° (28.33) 
If the regression of x on y is also linear, of form 
“= Bot By 
we shall also have 
Sip Re ets —= = (28.34) 
When r = 1, (28.32) and (28.34) give 
K11 = Bi k20 = Bi Ko 
whence 
By By = Ki /(K20 Koo) = Pp; (28.35) 
which is (26.17) again, p being the correlation coefficient between x and y. 


28.7 We now impose a further restriction on our variables : we suppose that the 
conditional distribution of y about its mean value (which, as before, is a function of 
the fixed value of x) is the same for any x, i.e. that only the mean of y changes with x. 
We shall refer to this restriction by saying that y “has identical errors.’ There is 
thus a variate « such that 


y= Migté. (28.36) 
In particular, if the regression is linear (28.36) 1s 
y = Bot Bixte. (28.37) 
If y has identical errors, (28.5) becomes 
f(%9) = g(x) h(e) (28.38) 


where h is now the conditional distribution of «. Conversely, (28.38) implies identical 
errors for y. 

The corresponding result for c.f.s is not quite so obvious: if the regression of 
y on x is linear with identical errors, then the joint c.f. of x and y factorizes into 


(ty te) = $y (ti +t281) bn (te) exp (222 Bo), (28.39) 
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the suffixes to ¢ denoting the corresponding f.f.s. ‘To prove (28.39), we note that 
sy | | exp (it; x-+itey) f(x,y) dx dy 


= | | expfitse+its(Bo+ Bix+e)}e()h(e) dede 


= { exp {i(t,-+t,8,)#}2(x) dex | exp (ite) h(e)de.exp(it,B,) (28.40) 


and (28.39) is simply (28.40) rewritten. Note that if 6, = 0, (28.39) shows that x and 
y are independent : linearity of regression, identical errors and a zero regression coeffi- 
cient imply independence, as is intuitively obvious. 


A characterization of the bivariate normal distribution 

28.8 We may now prove a remarkable result: if the regressions of y on x and of 
x on y are both linear with identical errors, then x and y are distributed in the bivariate 
normal form unless (a) they are independent of each other, or (b) they are functionally 
related. 

Given the assumptions of the theorem, we have at once, taking logarithms in (28.39), 


P (tit) = Yo(titte 1) +Yn(to) +2ts Bo, (28.41) 
and similarly, from the regression of x on y, 
p (ti, te) = Py (tot tr Bi) + Pw (t1) +21 Bos (28.42) 


where primes are used to distinguish the coefficients and distributions from those in 
(28.41). Equating (28.41) and (28.42), and considering successive powers of ¢, and fz, 
we find, denoting the rth cumulant of g by «,9, that of g’ by xo,, that of h by A,» and 
that of h’ by Ao;: 


First power : 
Ky91 (ty +t By) +Arotla tite Bo = Koit(to+ty Bi) +Aoitti titty Bo, 
or, equating coefficients of ¢, and of f,, 
Kyo = Ko1 Bi +4o1+ Bo» (28.43) 
K19 By t+Ayot Bo = Kor: (28.44) 
In point of fact, we may quite generally assume that the errors have zero means, for, 
if not, the means could be absorbed into By or Bj. If we also measure x and y from 
their means, (28.43) and (28.44) give 
Bo = By = 0, (28.45) 


as is obvious from general considerations. 


Second power : 
Koo (ti tts Bi)? +A20t3 = Koa (to +t, By)? + Aga ti, 
which, on equating coefficients of f¢7, t,t, and #3 gives 
ag =i ge (P31) +4023 (28.46) 
K20 By = Koe Bi (28.47) 
K90 Bit Aap —= Ko2- (28.48) 
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(28.46-8) give relations between g, h, g’ and h’; in particular, (28.47) gives the ratio 
B1/B, as equal to xo2/e9, the ratio of parent variances. 


Third power : 


K39 {2 (ti + £2 81) PB +Ago (tte So Ko3{1(tat ty By) }8 + Ags (tt1)°. 

The terms in #7, and f,23 give us 

K30B1 = Ko3 (6), (28.49) 

39 Bi = Ko3 A}. (28.50) 
Leaving aside temporarily the possibilities that 8,, 8, = 0 or B,8, = 1, we see that 
otherwise (28.49) and (28.50) imply x3) = k93 = 0. Similarly, if we take the fourth 
and higher powers, we find that all the higher cumulants «,9, «9, must vanish. ‘Then 
it follows from equations such as those obtained from the terms in #3, #3 in the third- 
power equation, namely 

“2 Ko3(B;)> +Aos, 
K30 BitAs0 = Koss 

that the cumulants after the second of h, h’ also must vanish. ‘Thus all the distributions 
g, h, g', h’ are normal and from (28.41) or (28.42) it follows that w(t,,¢,) is a quadratic 
in ¢,,¢,, and hence that x,y are bivariate normally distributed. 

In the exceptional cases we have neglected, this is no longer true. If 8, 8, = 1, 
the correlation between x and y is +1 by (28.35) and x is a strict linear function of y 
(cf. 26.9): if, on the other hand, 6, or B; = 0, the variables x, y are independent, as 
remarked at the end of 28.7. ‘This completes the proof of the theorem.’ 


Multivariate generalizations 
28.9 We now briefly indicate the extension of our results to the case of p regressors 
N1,Xg,...,X,. The linear regression is then 


PT es. hp) = gt Pi te FP, Xe (28.51) 


Writing the joint f.f. 
SUVs %1, «+ + Xp) = B(x) Ax (y) 


as at (28.5), where g(x) is the p-variate marginal distribution of x,,...,x«,, we find as 


at (28.6) 
p 
t.i..4 310 = | aa [exp (iuy-+i x t%;) (x) hey) de dy 
j=l 


. | — | exp (5 eed tl (28.52) 
as at (28.7). Just as at (28.8), 
un = | oeda()| 


u=-0 


and as at (28.9) 


(*) The first result of this kind appears to be due to Bernstein (1928). For a proof under 
general conditions see Féron and Fourgeaud (1952). 
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Fea ae t)] op { exp (24, 4,)¢ (x) a. dx, (28.53) 
u=0 
giving the generalization of (28.10) 


ens (ay { aohoiee E bbatice, t) dt. (28.54) 


uUL= 


28.10 ‘The reader should have no difficulty in extending the criterion of 28.5 for 
linearity of regression: if (28.51) is to hold, we must have 


Ow(u,t,,...,t ee 
p(u ¥ 2) = if t+ = Bs=-y(0, a ty), (28.55) 
u u=0 jt OH; 


generalizing (28.25). Similarly, the extension of the criterion of (28.32) is 


Ki, Ti, Tayeees Tp = By Ko, +1, Tayeeey tp Bo Ko, 1s r.+1, Tayeeey ty t as + By Ko, Myyeees Tp—1y t+1 (28.56) 
The condition (28.38) for identical errors generalizes to 


F (9) %1y ++ + Xp) = B(x) A(Z) 
and (28.39) generalizes to 


b (uU, ty, .-+y tp) = by (tr tu By totus, ...,ty+UBy) >, (u) exp (tu Bo). (28.57) 
Finally, generalizing 28.8, if each of the linear regressions of a set of p variables has 
identical errors, the variables are multinormally distributed unless they are mutually 
completely independent or they are functionally related. 


28.11 If the regression of y on x is a polynomial, of type 

E(y|x) = Bot Pi xt Bax? +...+ Byx, (28.58) 
we may obtain similar results. However, as we shall see later in this chapter, this is 
best treated as a particular case of the p-regressor situation where the regressors are 
functionally related, so that any results we require for the polynomial regression situa- 
tion may be obtained by specializing the results of 28.9-10. For example, a condition 
that (28.58) holds is 

Op (u, t) == ow (0, t) 02 (0, t) 0? (0, t) 
| ou Le = {Bo Ba aGt) *?2 ane °° th oGay f° 

which reduces to (28.25) (in a slightly different notation) when p = 1, and is easily 
obtained as a special case of (28.55) by noting that the c.f. of x” is E{exp (zt.x")}, whose 
derivative with respect to ¢ is 


Efix' exp(ttx’)} = #5 <_E {exp (it)}. 


The general linear regression model 

28.12 The analytical theory of regression, which we have so far discussed, is of 
interest in statistical theory but not in the practice of experimental statistics, precisely 
because it requires a detailed knowledge of the form of the underlying distribution. 
We now turn to the discussion of the general linear regression model, which is exten- 
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sively used in practice because of the simplified (but nevertheless reasonably realistic) 
assumptions which it embodies. ‘This is, in fact, simply the general linear model of 
19.4, with the parameters 6 as regression coefficients. (19.8) is thus rewritten 


y = XB+e, (28.59) 
where 8 is a (kx 1) vector of regression coefficients, X is an (nx k) matrix of known 
coefficients (not random variables), and € an ( x 1) vector of ‘‘ error’ random variables 
(not necessarily normally distributed) with means and dispersion matrix 

E(e) = 0, 

ink cca! ms (28.60) 
We assume n > k and |X'X| # 0. 

All the results of Chapter 19 now apply. From (19.12), 


® = (X’X)-1X'y (28.61) 
is the vector of LS estimators of B; from (19.16), its dispersion matrix is 
V(®) = o2(X’X)- (28.62) 


and from 19.6 it is the MV unbiassed linear estimator of B. Finally, from (19.41), an 
unbiassed estimator of o? is s?, where 


(n—k)s* = (y—XB)'(y-XB) =yy—-B Xy. (28.63) 
s? is the sum of squared residuals divided by the number of observations minus the 
number of parameters estimated. 
We have already applied this model to regression situations in 26.8 and 27.15. 


The meaning of “ linear ” 

28.13 Before proceeding further, it is as well to emphasize the meaning of the 
adjective “linear” in the general regression model (28.59): it is linear in the para- 
meters B;, not necessarily in the x’s. In fact, as we have remarked, the elements of X 
can be any set of known constants, related to each other in any desired manner. Up to 
28.11, on the other hand, we understood by “ linear regression ” that the conditional 
mean value of y is a linear function of the regressors x,,...,%,. From the point of 
view of our present (Least Squares) analysis, the latter (perhaps more “ natural ’’) 
definition of linearity is irrelevant ; it is linearity in the parameters which is essential. 
Thus the linear regression model includes all manner of “ polynomial” or “ curvi- 
linear” forms of dependence of y upon x,,...,%,. For example, the straightforward 
polynomial relationship 


si = Bot By Xt Boxijt... + By, xk; + &%, J = Se (28.64) 
is linear in the f’s, and thus is a special case of (28.59) (cf. the remarks in 28.11). Simi- 
larly, the “‘ multiple curvilinear ”’ case ; 

Ji = Bot Brij t+ Bo xij + Bs X05 + By X35 + Bs ¥ 15 X05 + 55 j= Litpan ofl (28.65) 
is a linear regression model. However, 
Vi = Bot Br x5 + Bo Xoj + Bix a5 + &, j=1,2,...,n 
is not, since 8, and fj both appear. 
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Other functions than polynomials may also appear in a linear model. ‘Thus 
Vi = Bot Br Hy; + Bo xij + Bs8in x1; COS X95 + €, 
is a linear model. 


28.14 With this understanding, we see that any linear regression analysis reduces 
to the mathematical problem of inverting the matrix of sums of squares and products 
of the regressors, X’X. ‘The inverse is required both for estimation in (28.61) and 
for estimating the dispersion matrix of the estimators from (28.62) and (28.63). No 
new point of statistical interest arises. 


Cochran (1938) and Kabe (1963) give formulae for adjusting an analysis when one 
or two of the original x-variables are omitted, or one or two new ones added. 

Hudson (1966) discusses generally the fitting by LS of segmented curves whose join 
points must be estimated. 


Orthogonal regression analyses 

28.15 It is evidently a convenience in carrying out a regression analysis if the 
estimators f; are uncorrelated: in fact, if the ¢; are normally distributed, so will the 
B; be, since they are linear functions of y, and lack of correlation will then imply inde- 
pendence. A regression analysis with uncorrelated estimators is called orthogonal. 
Since the regressors are not now random variables, but constants which are at choice 
in experimental work, we may now ask a new type of question: can, and if so how 
should, the elements of X be chosen so that the estimators f; are uncorrelated? 

This is a question arising in the theory of experimental design, and we defer a 
detailed discussion of design problems to Volume 3. However, we observe from 
(28.62) that if, and only if, (X’ X)-1 is diagonal, the analysis is orthogonal ; and (X’ X)"1 
is diagonal only if X’X is. Thus, to obtain an orthogonal analysis, we must choose 
the elements of X so that X’X is diagonal. It follows at once that we must have 


Suga Os lt £4. (28.66) 


j=1 
The diagonal elements of X’X are, of course, simply 


(X’X),, = Da%, 
j=1 
whence the corresponding inverse element is 
(XX) 7] = 1 i Fis (28.67) 
(28.61) and (28.62) are then particularly simple. 


Polynomial regression: orthogonal polynomials 
28.16 For a polynomial dependence of y upon x, as in (28.64), X’X cannot be 
diagonal, since the off-diagonal elements will be sums of powers of a single variable x. 
However, we can choose polynomials of degree z in x, say ¢;(x), (¢ = 0,1,2,...,), 
which are mutually orthogonal. ‘Then (28.64) is replaced by 
Vi = Ly ho (Xj) +0, py (Xj) + 06. Hp hy (Hj) + 8%, f= 1,2,;,..,8, (2868) 


which we may write in matrix form y = ®a+e. 
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The «’s are a new set of parameters (functions of the original f’s), in terms of which 
it is more convenient to work, for we now have, from (28.67), 


[(@’@)-],; = 1 / x hi (x;), (28.69) 
j=l 
which we may use in (28.62). Furthermore, (28.61) becomes, using (28.69), 
=e ee a / Bi (a). (28.70) 


Thus each estimator &; depends only on the corresponding polynomial. This is 
extremely convenient in “ fitting ’’ a polynomial regression whose degree is not deter- 
mined in advance: we increase the value of k, step by step, until a sufficiently good 
“fit”? is obtained. If we had used the non-orthogonal regression model (28.64), the 
whole set of estimators Bo, fi...» B,-1 would have had to be recalculated when a 
further term, in x*, was added to raise the degree of the polynomial. Of course, if 


k ———S = 
we reassemble the estimated regression y= & &;¢,(x) as y = & f;x", the #; are 
i=0 i=0 


precisely those we should have obtained directly, though less conveniently, from (28.64). 
This follows from the fact that both methods minimize the same sum of squared 


residuals. 
Using (28.70), (28.63) becomes in the orthogonal case 


ey = ¥ yoy 
p> sie > {Ey Pi c)} / = $2(x3) (28.71) 


j=1 i=0 


n k n 
7-5 8 dix). (28.72) 
f=1 +=0 j=1 


These very simple expressions for the sum of squared residuals from the fitted regres- 
sion permit the rapid calculation of the additional reduction in residual variance brought 
about by increasing the degree k of the fitted polynomial. 


28.17 We now have to see how the orthogonal polynomials ¢;(x) are to be evalu- 
ated. We require 


Sh, (6) di (4) = 07 i Oyj Qyacaghs 4 hy (28.73) 
j=1 : 


where 


dle = a. (28.74) 


7==f 
There are (i+1) coefficients c;, in (28.74), and hence in all the polynomials ¢; there 
k 
are & (i+1) = 3(k+1)(k+2) coefficients to be determined. On these, (28.73) 
i=0 


imposes only 4k(k+1) constraints. We determine the excess (k+1) constants by 
requiring, as is convenient, that c;; = 1, all 7. We then have at once from (28.74) 


go(x) = Coo = 1 (28.75) 


358 THE ADVANCED THEORY OF STATISTICS 


identically in x. (28.73) is now just sufficient to determine the c;,, apart from an 
arbitrary constant multiplier, say 2,,, for each ¢,(x),7 > 0. (28.73) and (28.74) give, 
with h = k, 

n a k 

Cyr XG Ls Crug xi = O, 1k, 
or - 

i k 

p> Cir p> Crs Tees = 0, 1 ad k, (28.76) 

0 


r=0 $= 
where yu}, is the pth moment of the set of x’s. Since (28.76) holds for all 7 = 0, 1,2, 
...,k—1, we must have 


k 
D Crsttise = 0, r=O0,1,...,k—-1. (28.77) 
s=0 


Writing the determinant 


IMel=) 1 i tdo alexa 


M1 aoe Mon—2 [on—1 
| Mi Meta +++ Moe-1 MEE 
and | M,**| for the minor of the element in the uth row and vth column of | M;,|, the 
solution of (28.77) is (remembering that c,, = 1) 

Ge Pe ee |, 8 HD Be, (28.78) 
Thus, from (28.74) and (28.78), 


Mo yess Me 
Hy Mg eee Megs 


bx (x) = ay = eee (28.79) 


ee eee oT 

SS See 
(28.79) is used to evaluate the polynomial for any k. Of course, wo = 1, and we 
simplify by measuring from the mean of the x’s, so that w, = 0 and we may drop 
the primes. It will be observed that the determinant in the denominator of (28.79) 
is simply that in the numerator with its last row and column deleted. 


We find, for example, 


} 0 
1 
$1(x) = 7 = x, (28.80) 
10 ps 
O Me Ms 
2 
bo(x) = EA) = tHe, (28.81) 
He 


THE GENERAL THEORY OF REGRESSION 359 


and so on. A simpler recursive method of obtaining the polynomials is given in 
Exercise 28.23. 


The case of equally-spaced «-values 


28.18 The most important applications of orthogonal polynomials in regression 
analysis are to situations where the regressor variable, x, takes values at equal intervals. 
This is often the case with observations taken at successive times, and with data grouped 
into classes of equal width. If we have m such equally-spaced values of x, we measure 
from their mean and take the natural interval as unit, thus obtaining as working values 
of x: —4(n—1), —4(n—3), —4(n—5),..., $(n—3), $(n—1). For this simple case, 
the values of the moments in (28.79) can be explicitly calculated: in fact, apart from 
the mean which has been taken as origin, these are the moments of the first m natural 
numbers, obtainable from the cumulants given in Exercise 3.23. ‘The odd moments 
are zero by symmetry; the even moments are 


fg = (n?—1)/12, 
[tg = fg (3n?—7)/20, 
Mg = Me(3n*— 18n? + 31)/112, 


and so on. Substituting these and higher moments into (28.79), we obtain for the 
first six polynomials 


d(x) = i, 

p(x) Ze hin, 

2 (x) = Aan {x — 7's (n*—1)}, 

bg (X) = Agn (x? — 9'9 (30? —7) x}, | 

pa(*) = Agn {x4 — 7g (3n? — 13) 0? + 530 (nm? — 1) (n?—-9) }, 

bs (x) = Asn {x5 — 755 (0? — 7) x8 + aaog (15n* — 230n? + 407) x}, 

hg (x) = Agn {x8 — (30? — 31) x4 + p44 (Sut — 110? + 329) x? 

— ya87z (m’—1) (n?—9) (n?—25)}. 

Allan (1930) also gives ¢;(x) for 7 = 7, 8, 9, 10. Following Fisher (1921b), the arbi- 


trary constants /,, in (28.82), referred to below (28.75), are determined conveniently 
so that ¢;(x;) is an integer for all j = 1,2,...,m. It will be observed that 


Poi(x) = Pai(—x) and gy-1(%) = —$2i-1(—%); 


even-degree polynomials are even functions and odd-degree polynomials odd functions. 


(28.82) 


Tables of orthogonal polynomials 
28.19 ‘The Biometrika Tables give 9$;(x;) for all j, m = 3(1)52 and 7 = I(1) 


min (6,2—1), together with the values of /,, and y db; (x;). 
j=l 


Fisher and Yates’ Tables give ¢;(x;) (their &;),A;, and y ¢; (x;) for all j, n = 3(1)75 
j=1 


and 2 = 1(1)min(5,”—1). 
AA . 
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The Biometrika Tables give references to more extensive tabulations, ranging to 
1 = 9, n = 52, by van der Reyden, and toz = 5, m = 104, by Anderson and Houseman. 


28.20 There is a large literature on orthogonal polynomials. For theoretical 
details, the reader should refer to the paper by Fisher (1921b) which first applied them 
to polynomial regression, to a paper by Allan (1930), and three papers by Aitken (1933). 
More recently, Rushton (1951) discussed the case of unequally-spaced x-values, and 
C. P. Cox (1958) gave a concise determinantal derivation of general orthogonal poly- 
nomials, while Guest (1954, 1956) has considered grouping problems. 

We shall content ourselves here with a single example of fitting orthogonal poly- 
nomials in the equally-spaced case. 


Example 28.3 

The first two columns of Table 28.1 show the human population of England and 
Wales at the decennial Censuses from 1811 to 1931. ‘These observations are clearly 
not uncorrelated, so that the regression model (28.64) is not strictly appropriate, but 
we carry through the fitting process for purely illustrative purposes. 


Table 28.1 
Population Zee 1 
Year (millions) = =x = ¢,(x) $2(x) $ (x) $4(x) 
¥y 

1811 10-16 —6 pe. —1i1 99 
1821 12-00 —5 11 0 — 66 
1831 13-90 —4 2 6 — 96 
1841 15-91 —3 —5 8 — 54 
1851 17-93 —2 —10 Z 11 
1861 20-07 —1 —13 4 64 
1871 22°71 0 —14 0 84 
1881 25°97 1 —13 —4 64 
1891 29-00 2 —10 —7 11 
1901 32°53 3 —5 —8 — 54 
1911 36:07 4 2 —6 — 96 
1921 37°89 5 11 0 — 66 
1931 39°95 6 22 11 99 
Sy tad 344-09 et 1 1/6. Se 

13 

D d2(xj): 182 2002 572 68,068 

j= 


Here n = 13, and from the Biometrika Tables, Table 47, we read off the values in 
the last four columns of Table 28.1. From that Table, we have 

Ey shox) = Ey; = 314-09, 

B71 (4) = 474-77, 

XV;Ge(x;) = 123-19, 

Eysb5(x;) = —39-38, 

Lyiha(x;) = —374-30. 
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Hence, using (28.70), } 
& = 314-09/13 = 24-160, 8, 
&, = 474-77/182 = 2-608, 63, 


te 1251972402 = 0-061; 533, 5, 
& , = —39-38/572 = —0-068, 846, 2, 
&, = —374:30/68,068 = —0-005, 498, 91. 


For the estimated fourth-degree orthogonal polynomial regression of y on x, we then 
have, using (28.68) and (28.82), 
y = 24-1608 +4 2-608, 63 «+ 0-061, 533, 5 (x?— 14) 
— 0-068, 846, 2 {4 (x3 — 25 x)}— 0-005, 498, 91 {5% (wt — 244 x? + 144) }. 
If we collected the terms on the right so that we had 
y = Bot Bret Bax? + Bax + Bax, 

the coefficients £; would be exactly those we should have obtained if we had used 
(28.64) instead of the orthogonal form (28.68). The advantage of the latter, apart 
from its computational simplicity, is that we can simply examine the improvement in 
‘fit’ of the regression equation as its degree increases. We require only the calcula- 


tion of 
Ly; = 8,839-939, 
j 


and we may substitute the quantities already calculated into (28.72) for this purpose. 
Thus we have: 


Total sum of squares | 8,839-939 
Reduction due to & = &uUdé = (24-160, 8)?.13 = 7,588:656 
Residual : 1,251-:283 

9 a te = Se, =f 005, 05)" 182 = 1,238-497 
Residual: 12-786 

9 og Oe =e eee = (0-064, 535; 5) 2 00a = 7:580 
Residual: 5-206 

9 egy = Gee, = (0-068, 890, 2)* S72 = 2:711 
Residual : 2:495 

» 9p Og = HUD? = (0-005, 498, 91)?.68,068 = 2:058 


Residual : 0-437 


Evidently, the cubic and quartic expressions are good “‘fits’’: they are displayed 
in Fig. 28.1. | : 

The reader should not need to be warned against the dangers of extrapolating 
from a fitted regression, however close, which has no theoretical basis. In this case, 
for example, he can satisfy himself visually that the value ‘“‘ predicted ”’ by the quartic 
regression for 1951 (« = 8) is a good deal less than the Census population of 43-7 
millions actually found in that year. 
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ae 
a 


Population (millions) 


at Be ee ee ee 
Years 


Fig. 28.1—Cubic (full line) and quartic (broken line) polynomials fitted to the data 
of Table 28.1 


Confidence intervals and tests for the parameters of the linear model 

28.21 In 28.12 we discussed the point estimation of the parameters B, o* of the 
general linear regression model (28.59). If we now assume € to be a vector of normal 
error variables, as we shall do for the remainder of this chapter, we may set confidence 
intervals for (and correspondingly test hypotheses concerning) any component of the 
parameter vector 8. These are all linear hypotheses in the sense of Chapter 24 and 
the tests are all LR tests. 

Any estimator f; is a linear function of the y, and is therefore normally distributed 
with mean f; and variance, from (28.62), 


var (B;) = 0? [(X’X)-]j,. (28.33) 
(If the analysis is orthogonal, (28.67) is used in (28.83).) From 19.11, s®, the esti- 
mator of o? defined at (28.63), is distributed independently of 8 (and ome of any com- 


ponent of @), the distribution of (n—k)s?/o? being of the zy? form with » = (n—A) 
degrees of freedom. It follows immediately that the statistic 


t = (B:—B)/{s?[(X'X) Jas}, (28.84) 
being the ratio of a standardized normal variate to the square root of an independent 
y?/y variate, has a “ Student’s”’ ¢-distribution with » = (n—k) degrees of freedom. 


This enables us to set confidence intervals for 8; or to test hypotheses concerning its 
value. The central confidence interval with coefficient (l—«) is simply 


Bith—w{e[(%' X)A]a} (28.85) 
where #,_:, is the value of “‘ Student’s ” ¢ for v degrees of freedom for which its distri- 


bution function 
Fit i) = 1—-}a. 
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Since we are here testing a linear hypothesis, the test based on (28.84) is a special 
case of the general variance-ratio F-test for the linear hypothesis given in 24.28: here 
we have only one constraint, and the F-test reduces to a ¢? test, corresponding to the 
central confidence interval (28.85). 


Confidence intervals for an expected value of y 


28.22 Suppose that, having fitted a linear regression model to n observations, 
we wish to estimate the expected value of y corresponding to a given value for each of 
the & regressors x,,...,%,. If we write these given values as a (1 xk) vector x°, we 
have at once from 19.6 that the minimum variance unbiassed estimator of the expected 
value of y for given x°® is 


J = (x°)' 8B, (28.86) 
and that its variance is, by 19.6 and (28.62), 
var J = (x°)’V(B)x° = o2(x°)’ (X’ X)-1x°®. (28.87) 


Just as in 28.21, we estimate the sampling variance (28.87) by inserting s? for o?, 
and set confidence limits from ‘‘ Student’s ”’ ¢-distribution, which here applies to the 
statistic 

t= {y—E(y|x°)}/{s?(x°)' (KX) +x} (28.88) 
with »v = (n—R) as before. 


Confidence intervals for the expectation of a further value of y: prediction intervals 

28.23 ‘The results of 28.22 may be applied to obtain a confidence interval for 
the expectation of a further ((m+1)th) value of y, y,.1, not taken into account in fitting 
the regression model. If x°® represents the given values of the regressors for which 
VYn+1 1s to be observed, (28.86) gives us the unbiassed estimator 


Inti = (x°)'B (28.89) 
just as before, but the fact that y,,,, will have variance o? about its expectation increases 
its sampling variance over (28.87) by that amount, giving us 


Var Yni1 = o7{ (x°)’ (K’ X)-1x°+ 1} (28.90) 
which we estimate, putting s? for o7 as before, to obtain the ‘‘ Student’s ”’ variate 
t = {Inzi—E(¥n41| ®°) }/[5°{ (#) (KX) tx? + 15 (28.91) 


again with v = (n—k), from which to set our confidence intervals. 

Similarly, if a set of N further observations are to be made on y at the same Xp, 
(28.89)-(28.91) hold for the estimation of the mean Vy to be observed, with the obvious 
adjustment that the unit in the braces in (28.90) and (28.91) is replaced by 1/N, = 
additional variance now being o?/N. 

Confidence intervals for further values, such as those discussed in this section, 
are sometimes called prediction intervals ; it must always be borne in mind that these 
‘‘ predictions ’’ are conditional upon the assumption that the linear model fitted to 
the previous 7 observations is valid for the further observations too, i.e. that there is 
no structural change in the model. 
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In the simple case 
Vj = Bit Boxjt+é;, E i = 1, Se 2 (28.92) 
we have seen in Examples 19.3, 19.6, that 
p, = 2AM 9) aie ar (x;—%)?, 
By = I— Bak, 
1 Beez 
= pad 


and 


(X’X)-1 = 1 ex =e 


~ S(x,—a#¥\ -—#- 1 
j 
1 
Here x° is the two-component vector ( op and we may proceed to set confidence 
= 


intervals for B,,B2, E(y|x°) and E(yn+1|x°), using (28.84), (28.88) and (28.91); in 


each case we have a “‘ Student’s”’ variate with (n—2) degrees of freedom. 


(a) It will be noticed that the analysis is orthogonal if and only if # = 0, so that 
in this case we need only make a change of origin in x to obtain orthogonality. Also, 
the variances of the estimators (the diagonal elements of their dispersion matrix) are 
minimized when « = 0 and 2x? is as large as possible. Both orthogonality and mini- 
mized sampling variances are therefore achieved if we choose the x; so that (assuming 
n to be even) 

Hy, Moye +e Xin = 14, 
Nint1)Vint+2) +++ 9% = —A, 
and a is as large as possible. ‘This corresponds to the intuitively obvious fact that if 
we are certain that the dependence of y upon x is linear with constant variance, we can 
most efficiently ‘‘ fix’ the line at its end-points. However, if the dependence were 
non-linear, we should be unable to detect this if all our observations had been made 
at two values of x only, and it is therefore usual to spread the x-values more evenly 
over its range; it is always as well to be able to check the structural assumptions of 
our model in the course of the analysis. 
(b) Our confidence interval in this case for E(y|x®) is, from (28.88) 


wy bitint seal) (Te 1 )(a)} 
2 Bit Bas) sd (C+ seme) } (28.93) 


If we consider this as a function of the value x°, we see that (28.93) defines the two 
branches of a hyperbola of which the fitted regression (8,+ 8,x°) is a diameter. The 
confidence interval obviously has minimum length when x°® = #, the observed mean, 
and its length increases steadily as |x°—%| increases, confirming the intuitive notion 
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that we can estimate most accurately near the ‘“‘ centre’”’ of the observed values of x. 
Fig. 28.2 illustrates the loci of the confidence limits given by (28.93). 


Robison (1964) gives ML estimates and confidence intervals for the intersection 
abscissa of two polynomial regressions, and a bibliography of related work. 


Upper confidence limit 
7? for Ely|x° 


Fitted regression line 


oe ee OS ee ee 
= 
~= 


lower canteece limit 
For E ( y|x° 


Confidence interval 
For y given x° 


oO 


x x 
Observed mean 
Values of x 


Fig. 28.2—Hyperbolic loci of confidence limits (28.93) for an expected value of y in 
simple linear regression 

28.24 ‘The confidence limits for an expected value of y discussed in Example 28.4(b), 
and more generally in 28.22, refer to the value of y corresponding to a particular x° ; 
in Fig. 28.2, any particular confidence interval is given by that part of the vertical 
line through x, lying between the branches of the hyperbola. Suppose now that we 
require a confidence region for an entire regression line, i.e. a region R in the (x, y) plane 
(or, more generally, in the (x,y) space) such that there is probability 1—« that the 
true regression line y = x® is contained in R. ‘This, it will be seen, is a quite distinct 
problem from that just discussed ; we are now seeking a confidence region, not an 
interval, and it covers the whole line, not one point on the line. We now consider this 
problem, first solved in the simplest case by Working and Hotelling (1929) in a remark- 
able paper; our discussion follows that of Hoel (1951). 


Confidence regions for a regression line 
28.25 We first treat the simple case of Example 28.4 and assume o? known, 
restrictions to be relaxed in 28.31-2. For convenience, we measure the x; from their 
mean, so that # = 0 and, from Example 28.4(a), the analysis is orthogonal. We then 
have, from the dispersion matrix, var B, = o2/n, var by = o2/Dx?, and B, and f, are 
normally and independently distributed. ‘Thus 
u = ni(B,—B,)/o, v = (Z4°)!(Bo—Ba)/0, (28.94) 


are independent standardized normal variates. 
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Let g(u?,v?) be a single-valued even function of u and v, and let 
oO) = 2i-., ne ee (28.95) 

define a family of closed curves in the (u, v) plane such that (a) whenever g,_,, decreases, 
the new curve is contained inside that corresponding to the larger value of 1—a«; and 
(b) every interior point of a curve lies on some other curve. To the implicit relation 
(28.95) between u and v, we assume that there corresponds an explicit relation 

u® = p(v") 
or ~ 

u= +h(v). (28.96) 
We further assume that h’(v) = dh(v)/dv exists for all v and is a monotone decreasing 
function of wv taking all real values. 


28.26 We see from (28.94) that for any given set of observations to which a regres- 
sion has been fitted, there will correspond to the true regression line, 
y = Bit Box, (28.97) 
values of uw and wv such that 


re (A+ 5) +( Bat asco x (28.98) 


Substituting (28.96) into (28.98), we have two families of regression lines, with wv as 
parameter, 


ee r o 
@ + = h ©)) = @ = (Sx) °) x, (28.99) 
one family corresponding to each sign in (28.96). We now find the envelopes of these 
families. 
Differentiating (28.99) with respect to v and equating the derivative to zero, we 
obtain 


2\ 
x= F (==) h’ (v). (28.100) 
Substituted into (28.99), (28.100) gives the required envelopes : 
(B.+ Bax) 4 {h(v)—oh' (2)}, (28.101) 


where the functions of v are to be substituted for in terms of x from (28.100). The 
restrictions placed on h’(v) below (28.96) ensure that the two envelopes in (28.101) 
exist for all x, are single-valued, and that all members of each family lie on one side 
only of its envelope. In fact, the curve given taking the upper signs in (28.101) always 
lies above the curve obtained by taking the lower signs in (28.101), and all members 
of the two families (28.99) lie between them. 


28.27. Any pair of values (u,v) for which 
£(u?, v7) < £14 (28.102) 
will correspond to a regression line lying between the pair of envelopes (28.101), because 
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for any fixed v, u? = {h(v)}* will be reduced, so that the constant term in (28.99) 
will be reduced in magnitude as a function of v, while the coefficient of x is unchanged. 
Thus if wu and wv satisfy (28.102), the true regression line will lie between the pair of 
envelopes (28.101). Now choose g;_,, so that the continuous random variable g (u?, v?) 
satisfies 

P{g(u?, v7) < g:_.} = 1—«. (28.103) 
Then we have probability 1—« that (28.102) holds, and the region R between the pair 
of envelopes (28.101) is a confidence region for the true regression line with confidence 
coefficient 1—a. 


28.28 We now have to consider how to choose the function g(u?, v?) so that, for 
fixed 1—«, the confidence region R is in some sense as small as possible. We cannot 
simply minimize the area of R, since its area is always infinite. We therefore introduce 
a weight function w(x) and choose R to minimize the integral 


= | ” (Y_—y1) 0 (x) de, (28.104) 
where ¥,,V, are respectively the ae and upper envelopes (28.101), the boundaries 
of R, and iS w(x)dx = 1. We may rewrite (28.104) 

— I = E(y:)—E(y1), (28.105) 
expectations being with respect to w(x). 


Obviously, the optimum R resulting from the minimization will depend on the 
weight function chosen. Putting S? = Xx?/n, consider the normal weight-function 


pe Onan (-zm) (28.106) 


which is particularly appropriate if the values of x, here regarded as fixed, are in fact 
sampled from a normal distribution, e.g. if x and y are bivariate normally distributed. 
Putting (28.101) and (28.106) into (28.105), it becomes 


20 t 
1 = “F LE{h(e)}—Ejoh (v) }]. (28.107) 
From (28.100) we have, since h’(v) is decreasing, 
dx = —Sh" (v) dv, (28.108) 


so that if we transform the integrals in (28.107) to the variable v, we find 


E{h} = —(n)-*| hh exp{—1(W)}ao, 
(28.109) 
E{vh'} = —(2n)-+ | ohh” exp{—4(h')?} de 


the integration in each case being over the whole range of v. Since h(v) is an even 
function, both the integrals need be taken for positive v only, and (28.109) gives, in 
(28.107), 

40 


Umax 
aoe Ale See ” = ’ _1fh’\2 
I a h’' (h—vh') exp {—3(h')?} do. (28.110) 
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This is to be minimized, subject to (28.103), which by the independence, normality 
and symmetry of the distributions of u and v is equivalent to the condition that 


Umax h(v 
(2n)-2 | { | exp (~Ju*) du} exp (— Jo") do aA (Loe) eee 
0 0 


It must also be remembered that we have required h’(v) to be a monotone decreasing 
function of v taking all real values. 


28.29 So that we can proceed effectively to the minimization of J, we choose a 
general form for g(u?, v2), and here, too, there is a ‘“‘natural”’ choice, the family of 
ellipses, which we write 


h(v) = b(a?-v?)}, (28.112) 
and we now have to minimize (28.110) for variation in a, subject to (28.111). Since 
(28.112) gives 

h’(v) = —6?v/h(v), \ 
28.113 
A’ (v) = —B[{A(a) P +b? 0" ]/{A(e) 3°, 


we therefore have to minimize (dropping the constant) 


_ pa (Li A(e) P+ 020? P 2 2 
y=t | Ulitay |e exPL~ HL /h()}*] do (28.114) 


for a choice of a, subject to 


a b(a?—v2)!/2 
{ { { pli dul exp (— Jo") de iif (28.115) 


0 
By differentiating (28.114) with respect to a, and replacing db/da in that derivative 
by its value obtained by differentiating (28.115), we find, after some reduction, 


dj = b® exp (4b?) { secs 
d 0 


—_— (1 — 27)? 
[ exp {— 2b?/(1 — t?)} dt f. exp {—3a?(1— 5?) t?} dt 
SS eee ee (28.116) 
| (1-#)exp{—Ja°(1—b) 2} de 
dj 


and for a minimum, we put = 0 and solve for a, and thence b. 


Hoel (1951) has carried this through for 1—« = 0-95, and found the ellipse (28.112) 
to have semi-axes of 2:62 and 2:32, not very far from equal. If we specialize the 
ellipse to a circle by putting 5 = 1 in (28.112), we find the radius a to be 2-45 for 
1—a = 0-95. Hoel found in this case that the value of J was less than one per cent 
larger than the minimum. 


28.30 ‘The choice of a circle for g(u?,v?) corresponds to the original solution of 
this problem by Working and Hotelling (1929), who derived it simply by observing 
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that, since wu and v in (28.94) are independent standardized normal variates, u*+ v? is 
a y? variate with 2 degrees of freedom, and a? (= g;_, in (28.103)) is simply the 
100 (1—«) per cent point obtained from the tables of that distribution. The boundaries 
of the confidence region are, putting (28.112) and (28.113) with 6 = 1 into (28.101), 


(B.+ Bax) +— OH) 


= pera [1 + {h'(v)}*]. (28.117) 
Using (28.100), (28.117) becomes 
(Bx+ Bax) (gra) {o + 0? 2 \ (28.118) 
1 2%) S1—a ic 3 x2 ? , 


the terms in the braces being {var 6,+ x®var B,}. 

If (28.118) is compared with the confidence limits (28.93) for E(y|x°) derived in 
Example 28.4 (where we now put # = 0, as we have done here), we see that apart 
from the replacement of s? by o*, and of the #,_;, multiple by the z multiple (g,_,)', 
the equations are of exactly the same form. Thus the confidence region (28.118) will 
look exactly like the loci of the confidence limits (28.93) plotted in Fig. 28.2, being 
a hyperbola with the fitted line as diameter. As might be expected, for given « the 
branches of the hyperbola (28.118) are farther apart than those of (28.93), for we are 
now setting a region for the whole line where previously we had loci of limits for a single 
value on the line. For example, with « = 0-05, t,_1, (with infinite degrees of freedom, 
corresponding to o? known) = 1-96, while g,_, for a x? distribution with two degrees 
of freedom = 5-99, the value 2:45 given for a at the end of 28.29 being the square root 
of this. 


28.31 If o? is unknown, only slight modifications of the argument of 28.25-30 
are required. Define the variable 


= (n—2)s?/o?, (28.119) 


so that w? is the ratio of the sum of squared residuals from the fitted regression to the 
true error variance, which (cf. 28.21) has a x? distribution with m—2 degrees of freedom. 
From (28.84) and (28.119), we see that the statistics 
u* = (n—2)'u/w = n'(B,—f,)/s and v* = (n—2)'v/w = n'(B—Be)/s 

each have a “ Student’s”’ distribution with n—2 degrees of freedom. If we now 
re-trace the argument of 28.25-30 using u* and v* in place of u and v, we find that 
g(u*?, v*?) is distributed independently of the parameters £,,8,,07. The solution 
of the weighted area minimization problem of 28.28-9 now becomes too difficult in 
practice, and we proceed directly to the classical solution given in 28.30. 

Using Hotelling and Working’s direct argument, we see that since, from 28.21, 
u2, v2 and w? are distributed independently of one another as x? with 1,1, and (m—2) 
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a n—2 
variance-ratio (/') distribution with 2 and n—2 degrees of freedom. ‘Thus if we re- 
place o by s in 28.30, and put g,_, equal to twice the 100(1—«) per cent point of this 
F-distribution, we obtain the required confidence region from (28.118). As in 28.30, 
we find that the boundaries of the region are always farther apart than the loci of the 
confidence limits for E'(y | x°). 


2 2 2 
degrees of freedom respectively, the ratio (" | 7 ( = 4(u*2+v*?) has a 


28.32 There is no difficulty in extending our results to the case of more than one 
regressor, a sketch of such a generalization having been given by Hoel (1951). With 
k 


k regressors we find, generalizing 28.31, that (u*¥?+ % v¥?)/(k+1) has a variance- 
i=1 


ratio distribution with (k+1,n—k—1) dfr. 

Gafarian (1964) gives a method for obtaining confidence regions for a polynomial 
regression over any subset of its range, with detailed tables for a region of constant 
width in the straight line case. 


EXERCISES 


28.1 The bivariate distribution of x and y is uniform over the region in the (x, y) 
plane bounded by the ellipse 
ax* + 2hxy + by? = ¢, 2 lite ee See Ss SS 
Show that the regression of each variable on the other is linear and that the scedastic 
curves are quadratic parabolas. 


28.2 The bivariate distribution of x and y is uniform over the parallelogram bounded 
by the lines x = 3(y—1), x = 3(y+1), x = y+1, x = y—1. Show that the regression 
of y on x is linear, but that the regression of x on y consists of sections of three straight 
lines joined together. 


28.3 Show that if (28.59-60) holds, but X has elements which are non-linear func- 
tions of r further parameters y,,..., yr, making (k+7) in all, the regression model can be 
augmented (cf. 19.13=16) to 


* Se (X, (6) +€, 


where D is any (x Xr) matrix chosen so that (X, D) is of full rank (k+7r), and 0 isarx1 
vector of zeros. Hence obtain confidence regions for (a) the complete set of (k+7) para- 
meters, (b) the r further parameters alone. 


(Halperin (1963); cf. also Hartley (1964)) 


28.4 From (28.17), show that if the marginal distribution of a bivariate distribution 
is of the Gram-—Charlier form 


f = «(x) {1+a,H,;+a,H,+ ...}, 
then the regression of y on x is 
ry 
r! 


c © 
p ey > 
¢=:0 ¢=-0 


1+ % a,H,(x) 
r=3 


as Fics (x) 
Mz = 


(Wicksell, 1917) 
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28.5 x4, Xs, %3 are trivariate normally distributed. Use 28.9 to show that the regres- 
sion of each variate on the other two is linear. 


28.6 Verify equation (28.33). 


28.7. If, for each fixed y, the conditional distribution of x is normal, show that their 
bivariate distribution must be of the form 
f (x, y) = exp {— (a,x? +a,x%+43) } 
where the aj are functions of y. Show that if, in addition, the equiprobable contours of 


f(x, y) are similar concentric ellipses, f must be bivariate normal. 
(Bhattacharyya, 1943) 


28.8 Show that if the regression of x on y is linear, if the conditional distribution of «x 
for each fixed y is normal and homoscedastic and if the marginal distribution of y is normal, 
then f (x, y) must be bivariate normal. (Bhattacharyya, 1943) 


28.9 I£ the conditional distributions of x for each fixed y, and of y for each fixed x, 
are normal, and one of these conditional distributions is homoscedastic, show that f (x, y) 
is bivariate normal. (Bhattacharyya, 1943) 


28.10 Show that if every non-degenerate linear function of x and y is normal, then 
f (x, y) is bivariate normal. (Bhattacharyya, 1943) 


28.11 If the regressions of x on y and of y on x are both linear, and the conditional 
distribution of each for every fixed value of the other is normal, show that f (x, y) is either 
bivariate normal or may (with a suitable choice of origin and scale) be written in the form 


f = exp {—(x? +a’) (y?+5%) }. (Bhattacharyya, 1943) 


98.12 Show that for interval estimation of # in the linear regression model 
yi = Bxite:, the interval based on the “ Student’s ” variate 


t = (b— B)/(s*/2 x) 
is physically shorter for every sample than that based on 
u = (§— Bx) /(s?/n)}. 


28.13 In setting confidence regions for a regression line in 28.28, show that if the 
weight function used is 
x \-3/2 


instead of (28.106), the Working—Hotelling solution of 28.30 is strictly optimum (area- 
minimizing) in the family of ellipses (28.112). (Hoel, 1951) 


28.14 Show that if there are two different vectors y,, y, each related to the same 
set of regressors x in a linear model, the difference between any pair of corresponding 
parameters in the models may be tested by applying the method of 28.21 to the differ- 


ences (V4;—2i)- - 
(Yates (1939b) also considers the case where the regressors 


are different and the y-vectors correlated.) 


28.15 Independent samples of sizes m are taken from two regression models 
ms ait Bixte, t= ¥ o 
with independently normally distributed errors. The error variance o* is the same in 
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both models. If 6,, b, are the separate Least Squares estimators of 8,, 82, show that 
(6,—6,) is normally distributed with mean (f,—/,) and variance 


and that 


2 : ; 
t= {(6,—6,)—(B,— 8:2) 1/1 (seca eas) 
3 i J 


has a “‘ Student’s ”’ ¢-distribution with n,+n,—4 degrees of freedom, where 
(n,—2) si + (%2—2) 53 
ny == No as 4 


and si, s; are the separate estimators of o? in the two models. Hence show that ¢ may 
be used to test the hypothesis that 8, = B, against B, 4 fo. (cf. Fisher, 1922b) 


3 = 


28.16 For the simple linear model y = f)+6,x+¢€, two independent samples, 
of sizes m and n, have means (Vm, Xm) and (Yn, ¥n). Show that b; = (¥m—Yn)/(Km—Xn) 


; : ss = See 
is an unbiassed estimator of £,, with variance o7{ —+—-)/(%nm—Xn)*?.. Show that 5, is 
mn 


not consistent (as m, 2 —> oo with m/n fixed) if the two samples were formed by random 
subdivision of an original sample of (m-+z7) observations. 


28.17 We are given n observations on the model 
y = By x,+ Pexgte 
with error variance o*, and, in addition, an extraneous unbiassed estimator 6, of f, 


together with an unbiassed estimator sj} of its sampling variance of. 'To estimate fg, 
consider the regression of (y—06,x,) on x2. Show that the estimator 
b, = Di (y — by &4) 2/2 x3 
is unbiassed, with variance 
var b, = (0? + 02772 x?) /D x2, 

where r is the observed correlation between x, and x,. If b, is ignored, show that the 
ordinary Least Squares estimator of $, has variance o?/{ix}(1—r?)} and hence that 
the use of the extraneous information about #, increases efficiency in estimating f, if 
and only if 

2 a? 

0; < 

Px? (1 — 7)’ 
i.e. if the variance of 5, is less than that of the ordinary Least Squares estimator of fy. 
Show that an unbiassed estimator of varb, is given by 


= 1 

V = (n—2) Sx [X(y—b, x,—b. x2)? +8, Day {(m—1)r?—-1 3], 
but that if the errors are normally distributed this is not distributed as a multiple of 
a x? variate. (Durbin, 1953) 


28.18 In generalization of the situation of Exercise 28.17, let b, be a vector of un- 
biassed estimators of the h parameters (f,, B2, ..., Bx), with dispersion matrix V,; and 
let bz be an independently distributed vector of unbiassed estimators of the k( > h) 
parameters (f,, Bz, ..., Brn,» Pati, .--, Bx), with dispersion matrix V,. Using Aitken’s 
generalization of Gauss’s Least Squares Theorem (19.17), show that the minimum 
variance unbiassed estimators of (f,, ..., 8x) which are linear in the elements of b, and 
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b, are the components of the vector 
b = {(Vy')*+ Vz" }-*{(Vyr")* bt + Vz" bz }, 
with dispersion matrix 
V(b) = {(Vr*)* + Ve7 }, 
where an asterisk denotes the conversion of an (k x1) vector into a (Rk x1) vector or an 
(h xh) matrix into a (k xk) matrix by putting it into the leading position and augmenting 
it with zeros. 
Show that V(b) reduces, in the particular case h=1, to 


e* ee 
ae 2X4 Xe = Lx, xn 
p> Xy Xo =2 x2 ee pa XoX 
2 2 Xk 
ee = ee 


differing only in its leading term from the usual Least Squares dispersion matrix 
a(n ays (Durbin, 1953) 


28.19 A simple graphical procedure may be used to fit an ordinary Least Squares 
regression of y on x without computations when the x-values are equally spaced, say at 
intervals of s. Let the n observed points on the scatter diagram of (y, x) be P,, Ps, ..., Pn 
in increasing order of x. Find the point Q, on P,P, with x-coordinate $s above that 
of P,; find Q,o0n QO, Ps with x-coordinate 2s above that of Q,; and so on by equal steps, 
joining each Q-point to the next P-point and finding the next Q-point $s above, until 
finally On—1Pn gives the last point, Qn. Carry out the same procedure backwards, starting 
from P,Pn—, and determining Q3, say, 2s below P, in x-coordinate, and so on until 
O;, on O;,1P; is reached, 2s below Q;,-1. Then Qn Q) is the Least Squares line. Prove 
this. (Askovitz, 1957) 


28.20 A matrix A of sums of squares and cross-products of m observations on p 
variables is inverted. A vector x containing one further observation on each variable 
becomes available. Show that the inverse of B = A+ xx’ is 


B-! = A-!—(A—! xx’ A-})/(1 +x’ A-!x). 
Hence show that a quadratic form x’ A~!x may be evaluated by 
1+x’A-ix = |A+xx’|/|Al. (Cf. Bartlett, 1951) 
28.21 In the regression model 
Mw = a+ Puta, $4 Fs FH; 
suppose that the observed mean * = 0 and let x, satisfy «+ Bx, = 0. Use the random 
variable «+ Bx, to set up a confidence statement for a quadratic function, of form 
P{Q(x) 2 0} = 1-«. 


Hence derive a confidence statement for x, itself, and show that, depending on the coeffi- 
cients in the quadratic function, this may place x: 


(i) in a finite interval ; 
(ii) outside a finite interval ; 
(iii) in the infinite interval consisting of the whole real line. 


(cf. Lehmann, 1959) 
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28.22 'To determine which of the models 
y=fotBixite, y= Bi t+Paxate’, 
is more effective in predicting y, consider the model 
yi = Bot Brxrt Borat &, t= 1,2,...,2, 
with independent normal errors of variance o?, estimated by s? with (n—2) degrees of 
freedom. Show that the statistics 


23 >= X(yi-J) (x94 —%4)/ {2 (%s5—¥,)"}', 5 = i, lay 


have var 2, = varz, = o%, cov(%,,2%2) = 9719, 
where 7,. is the observed correlation between x, and x2. Hence show that (z,—2¢) is 
exactly normally distributed with mean fj {X (x—%,)?}#— Bo {2X (%;—%2)?}* and variance 


v v 
207(1—7,,). Using the fact that & (yi—7)?—(B))? 4 (xs¢—%X5)* is the sum of squares of 
i i 


deviations from the regression of y on x, alone, show that the hypothesis of equality of these 
two sums of squares may be tested by the statistic t = (2,—22)/{29°(1 —ry»)}#, distri- 
buted in ‘‘ Student’s ” form with (n—3) degrees of freedom. 
(Hotelling (1940); Healy (1955). The test is generalized to the comparison 
of more than two predictors of y by E. J. Williams (1959).) 


28.23 By consideration of the case when y; = x}, j= 1, 2,..-, 4%, exactly, show 
that if the orthogonal polynomials defined at (28.73) and (28.74) are orthonormal (i.e, 


n 
py ¢? (xj) = 1, all 7) then they satisfy the recurrence relation 
= 1 k—-1 n 
$x (xj) = 8 — X gi(xj) L x5 di coh, 
3 k i=0 j=l 
where the normalizing constant b; is defined by 
n k—1 n 2 
b2 = & xi X bi (x) »¥ sate} : 
j=1 i=0 j=1 
Hence verify (28.80) and (28.81), with appropriate adjustments. (Robson, 1959) 


28.24 Inthe linear model y = X,8,+X_ ®.+¢€, show that the LS estimators may 
be written as 
7 = Oy X,)-1 Xj (y — X28), 6, = (KX, DX,)~* X; Dy, 
where D = I—X, 0.4 X,)7! Xj. 
If @, is first estimated from 
y = X, B, +e* (A) 
and ®, is then estimated, using the residuals y, in (A) as though they were uncorrelated, 
from 
yr = X, B.+, 
show that the estimators obtained are 
8,* = (KX X,)-'Xiy, G.* = (X, X,)-!X; Dy, 
and that B,* and ®,* are biassed unless X,; X, = 0 orB, = 0. If, is a scalar parameter, 


show that 
B.* = (1—R*)B,, 


where R is the multiple correlation coefficient of the single variable x, upon all the variables 
in XX. 

In the case yj = B,x1j+f2%ej +6, show that the mean-square-errors of the biassed 
two-stage estimators B,*, Bs* are less than the variances of the unbiassed LS estimators 
Bi, Bs if B3/V(B2) <1. 

(Cf. Freund et al., (1961) Goldberger and Jochems (1961), 
Goldberger (1961), Zyskind (1963) and T. D. Wallace (1964)) 


CHAPTER 29 
FUNCTIONAL AND STRUCTURAL RELATIONSHIP 


Functional relations between mathematical variables 


29.1 Itis common in the natural sciences, and to some extent in the social sciences, 
to set up a model of a system in which certain mathematical (not random) variables 
are functionally related. A well-known example is Boyle’s law, which states that, at 
constant temperature, the pressure (P) and the volume (V) of a given quantity of gas 
are related by the equation 

PV = constant. (29.1) 


(29.1) may not hold near the liquefaction point of the gas, or possibly in other parts 
of the range of P and V. If we wish to discuss the pressure-volume relationship in 
the so-called adiabatic expansion, when internal heat does not have time to adjust itself 
to surrounding conditions, we may have to modify (29.1) to 

PV” = constant, (29.2) 
where y is an additional constant which may have to be estimated. Moreover, at some 
stage we may wish to take temperature (‘T) into account and extend (29.1) to the form 

Py'l = constant. 


In general, we have a set of variables X,,..., X, related in p functional forms 
| | ee OL ees | eee es Oe ee rs (29.5) 
depending on / parameters «,, r = 1, 2,...,/. Our object is usually to estimate 


the «, from a set of observations, and possibly also to determine the actual functional 
forms f;, especially in cases where neither theoretical considerations nor previous 
experience provide a complete specification of these forms. If we were able to observe 
values of X without error, there would be no statistical problem here at all: we should 
simply have a set of values satisfying (29.3) and the problem would be merely the 
mathematical one of solving the set of equations. However, experimental or observa- 
tional error usually affects our measurements. What we then observe is not a “ true ”’ 
value X, but X together with some random element. We thus have to estimate the 
parameters «, (and possibly the forms f;) from data which are, to some extent at least, 
composed of samples from frequency distributions of error. Our problem then 
immediately becomes statistical. 


29.2 In our view, it is particularly important in this subject, which has suffered 
from confusion in the past, to use a clear terminology and notation. In this chapter, 
we shall denote mathematical variables by capital Roman letters (actually italic). As 
usual, we denote parameters by small Greek letters (here we shall particularly use « 
and #) and random variables generally by a small Roman letter or, in the case of 
Maximum Likelihood estimators, by the parameter covered by a circumflex, e.g. &. 


Error random variables will be symbolized by other small Greek letters, particularly 
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6 and «, and the observed random variables corresponding to unobservable variables 
will be denoted by a “ corresponding ” “ Greek letter, e.g., € for X. The only possible 
source of confusion in this system of notation is that Greek letters are performing 
three roles (parameters, error variables, observable variables) but distinct groups of 
letters are used throughout, and there is a simple way of expressing our notation which 
may serve as a rescuer: any Greek letter “‘ corresponding ” to a capital Roman letter 
is the observable random variable emanating from that mathematical variable ; all other 
Greek letters are unobservables, being either parameters or error variables. 


29.3. We begin with the simplest case. Two mathematical variables X and Y 
are known to be linearly related, so that we have 


Y = a)+0,X, (29.4) 
and we wish to estimate the parameters «, «,. We are not able to observe X and Y ; 
we observe only the values of two random variables &, 7 defined by 


° : pts ae ee (29.5) 
The suffixes in (29.5) are important. Observations about any “ true ” value are distri- 
buted in a frequency distribution of an “ error ” random variable, and the form of this 
distribution may depend on 7. For example, errors may tend to be larger for large 
values of X than for small X, and this might be expressed by an increase in the variance 
of the error variable 6. 

In this simplest case, however, we suppose the 6; to be identically distributed, so 
that 6, has the same mean (taken to be zero without loss of generality) and variance 
for all X,; and thus also for ¢and Y. Wealso suppose the errors 6, ¢ to be uncorrelated 
amongst themselves and with each other. For the present, we do not assume that 6 
and e are normally distributed. Our model is thus (29.4) and (29.5) with 


E(6;) = E(e,)) = 0, var 6; = 03, vare; = oO”, all 7, 
cov (6,, 6;) = cov(e, €) =0, ty, (29.6) 
cos (det edas = alles g- 
The restrictive assumption on the means of the 6; is only that they are all equal, and 
similarly for the e; we may reduce their means jus and yu, to zero by absorbing them 


into %», since we clearly could not distinguish « from these biases in any case. 
In view of (29.6) we may on occasion unambiguously write the model as 


= X+0, 
: e ed: (29.7) 


29.4 At first sight, the estimation of the parameters in (29.4) looks like a problem 
in regression analysis ; and indeed, this resemblance has given rise to much confusion. 
In a regression situation, however, we are concerned with the dependence of the mean 


(*) It will be seen that the Roman-Greek ‘‘ correspondence ”’ is not so much strictly alpha- 
betical as aural and visual. In any case, it would be more logical to use the ordinary lower-case 
Roman letter, i.e. the observed x corresponding to the mathematical variable X, but there is 
danger of confusion in suffixes, and besides, we need x for another purpose—cf. 29.6. 
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value of 7 (which is Y) upon X, which is not subject to error; the error variable 6 
is identically zero in value, so that of = 0. Thus the regression situation is essentially 
a special case of our present model. In addition (though this is a difference of back- 
ground, not of formal analysis), the variation of the dependent variable in a regression 
analysis is not necessarily, or even usually, due to error alone. It may be wholly or 
partly due to the inherent structure of the relationship between the variables. For 
example, body weight varies with height in an intrinsic way, quite unconnected with 
any errors of measurement. 

We may easily convince ourselves that the existence of errors in both X and Y 
poses a problem quite distinct from that of regression. If we substitute for X and Y 
from (29.7) into (29.4), we obtain 

N = &yta,Eé+(E—«,0). (29.8) 
This is not a simple regression situation: & is a random variable, and it is correlated 
with the error term (e—«,06). For, from (29.6) and (29.7), 
EE (e—a,8)} = E{(X+)(e—a,6)} 

= — 03, (29.9) 
which is only zero if o5 = 0, which is the regression situation, or in the trivial case 
a= 0. 

The equation (29.8) is called a structural relation between the observable random 


variables ¢, 7. ‘This structural relation is a result of the functional relation between 
the mathematical variables X, Y. 


cov (§, e— a, 0) 


29.5 In regression analysis, the values of the regressor variable X may be selected 
arbitrarily, e.g. at equal intervals along its effective range. But they may also emerge 
as the result of some random selection, i.e. m pairs of observations may be randomly 
chosen from a bivariate distribution and the regression of one variable upon the other 
examined. (We have already discussed these alternative regression models in 26.24, 
27.29.) In our present model also, the values of X might appear as a result of some 
random process or as a result of deliberate measurement at particular points, but in 
either case X remains unobserved due to the errors of observation. We now discuss 
the situation where X, and hence Y, becomes a random variable, so that the functional 
relation (29.4) itself becomes a structural relation between the unobservables. 


Structural relations between random variables 


29.6 Suppose that X, Y are themselves random variables (in accordance with our 
conventions we shall therefore now write them as x, y) and that (29.4), (29.5) and 
(29.6) hold as before. (29.8) will once more follow, but (29.9) will no longer hold 
without further assumptions, for in it X was treated as a constant. The correct version 
of (29.9) is now : 


cov (é, e—a,0) = E{(x+6)(e—«,6)} = E(xe)—a, E(xd)— «105, (29.10) 
and we now make the further assumptions (two for x and two for y) 
cov (x, 6) = cov(x,e) = cov(y, 6) = cov(y, €) = 0. (29.11) 
(29.11) reduces (29.10) to (29.9) as before. 
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The present model is therefore 


: z ve (29.12) 
Yi = ota Xs (29.13) 


subject to (29.6) and (29.11), leading to (29.8) as before. We have replaced the func- 
tional relation (29.4) between mathematical variables by the structural relation (29.13) 
expressing an exact linear relationship between two unobservable random variables 
x, y. The present model is a generalization of our previous one, which is simply the 
case where x; degenerates to a constant, X,;. The relation (29.8) between the observ- 
ables £, 7 is a structural one, as before, but we also have a structural relation at the 
heart of the situation, so to speak. 

The applications of structural relation models are principally to the social sciences, 
especially econometrics. We shall revert to this subject in connexion with multivariate 
analysis in Volume 3. Here, we may briefly mention by way of illustration that if 
the quantity sold (y) of a commodity and its price (x) are each regarded as random 
variables, the hypothesis that they are linearly related is expressed by (29.13). If both 
price and quantity can only be observed with error, we have (29.12) and are therefore 
in the structural relation situation. ‘The essential point is that there is both inherent 
variability in each fundamental quantity with which we are concerned and observational 
error in determining cach. 


29.7. One consequence of the distinctions we have been making has frequently 
puzzled scientists. The investigator who is looking for a unique linear relationship 
between variables cannot accept two different lines, but he was liable in the early days 
of the subject (and perhaps sometimes even today) to be presented with a pair of 
regression lines. Our discussion should have made it clear that a regression line does 
not purport to represent a functional relation between mathematical variables or a 
structural relation between random variables : it either exhibits a property of a bivariate 
distribution or, when the regressor variable is not subject to error, gives the relation 
between the mean of the dependent variable and the value of the regressor variable. 
The methods of this chapter, which our references will show to have been developed 
largely within the last twenty years, permit the mathematical model to be more precisely 
fitted to the needs of the scientific situation. 


29.8 It is interesting to consider how the approach from Least Squares regression 
analysis breaks down when applied to the estimation of «) and a, in (29.8). If we 
have n pairs of observed values (&;, 7;), 1 = 1,2,...,m, we find on averaging (29.8) 
over these values 


7 = ot ay 8+ 25 (e049). (29.14) 


The last term on the right of (29.14) has a zero expectation, and we therefore have the 
estimating equation 
ij = tote, &, (29.15) 
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which is unbiassed in the sense that both sides have the same expectation. If we 
measure from the sample means &, 7, we therefore have, as an estimator of «, 


ay = 0. (29.16) 
Similarly, multiplying (29.8) by &, we have on averaging 
1 Gre 3 
ve = — —LEé(e— : 
Hung = xé ee &(e—«, 0), (29.17) 


where a, is the estimator of «,. The last term on the right of (29.17) does not vanish, 
even as n—> 0, for it tends to cov {é,e—«,&}, a multiple of o3 by (29.9). It seems, 
then, that we require knowledge of o} before we can estimate «,, by this method at 
least. Indeed, we shall find that the error variances play an essential role in the 
estimation of «,. 


ML estimation of structural relationship 

29.9 If we are prepared to make the further assumption that the pairs of observ- 
ables €,, 1; are jointly normally and identically distributed, we may use the Maximum 
Likelihood method to estimate the parameters of the structural relationship model 
specified by (29.6) and (29.11)-(29.13). (This joint normality would follow from the 
x, being identically normally distributed, and similarly for the y,, 6; and ¢;; if x, y 
degenerate to constants X, Y, bivariate normality of 6, « would be sufficient for the 
joint normality of &, 7.) We then have, by (29.6) and (29.11)-(29.13), the moments 


E(é) = E(x) = 

E(y) = E(y) = tot om 

varé = varx+o} = o%+0%, (29.18) 

vary = vary+o. = afor+o%, 

bovie-7) = Cov(s, vy) = a;0;. 

It should be particularly noted that in (29.18) all the structural variables x; have the 
same mean, and hence all the y; have the same mean. ‘This is of importance in the 
ML process, as we shall see, and it also means that the results which we are about to 
obtain for structural relations are only of trivial value in the functional relation case, 


since they will apply only to the case where X; (the constant to which x; degenerates 
when o; = 0) takes the same value (w) for all 7. See 29.13 below. 


29.10 From (16.47) and Examples 18.14-15, the set of sample means, variances 
and covariance are sufficient statistics for the five parameters of a bivariate normal 
distribution, and are also the ML estimators of these parameters. Thus if s?, s? (both 
> 0) are the sample variances and s;, the sample covariance, the solutions of the equations 


(a) was 

(b) ata e=7 : 
(c) ot+o3 = 8 (29.19) 
(a “ee re = = 

(e) Rye Fy, 


for the unknowns among the six parameters ju, %, «,, 02, of and o% will be the ML 
estimators of these parameters also, provided that these solutions yield admissible 
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values for allofthem. Since ym, %» and «, are unrestricted a priori, we need only ensure 
that the solutions for o7, o5 and o% are non-negative. From (29.19)(c)-(e), these give 
the restrictions: 


For 65 > 0 (fae ==, 

For @>0 = (ii) 82 > ay Sem 
(iii) s7—o3 > 0, 

For & >0] (iv) 2-0? > 0, (29.20) 
(v) if of > 0, a, S 0 with s:,, 
(vi) if oz = 0, a, is indeterminate. 


If the restrictions (29.20) are not satisfied, the solutions of (29.19) are not the ML 
estimators for our problem—they must instead be obtained by direct maximation of 
the LF. (29.20) (vi) will remain true in that case, as the moments (29.18) show. 

(29.19)(c)-(e) give the equalities 


1(s? — 95) = Sin) oes = % Séns (29.21) 
and making the coefficients of «, equal in these, we find 
a4 S5,(88—03) = 5 = (8—03)(68 08). (29.22) 
(29.21) implies that 
[801 < Joy) < ha ae (29.23) 
SE | Sén | 


so that a ML estimate of the slope of the structural line obtained from (29.19) is bounded 
in absolute value by the LS regression coefficient of 7 on € and by the reciprocal of the 
LS regression coefficient of € on 7. (29.19) (a)-(b) then implies that the estimated 
structural line will lie between the two estimated regression lines, as is intuitively 
reasonable. 


29.11 Whether the ML estimation is accomplished through (29.19) or by direct 
maximization of the LF, (29.19) (a)-(b) will always give the ML estimators and & 
once &, is determined. When (29.19) is used, equations (c)-(e) must be solved for a, 
but we cannot do this without some further assumptions, since there are four unknowns 
in these three equations. 

The reason for this difficulty is not far to seek. Looking back at (29.18), we see 
that a change in the true value of «, need not change the values of the five moments 
given there. For example, suppose and a, are positive; then any increase in the 
value of «, may be offset (a) in E(7) by a reduction in &, (b) in cov (é, 7) by a reduction 
in o2, and (c) in var7 by an appropriate adjustment of of. (The reader will, perhaps, 
like to try a numerical example.) What this means is that «, is intrinsically impossible 
to estimate, however large the sample ; it is said to be unidentifiable. In fact, u alone 
of the six parameters is identifiable. We do not wish to assume knowledge of a» and 
%,, whose estimation is our primary objective, or of 07, since x is unobservable. = is 
already identifiable, so we cannot improve matters there. Clearly, we must make an 
assumption about the error variances. 
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29.12 Case 1: of known 
(29.21) gives at once 
&, = S¢,/(s? —05) (29.24) 
if s? > 03, which ensures that (29.20 (iii)-(v) hold, but (29.20)(1i) must be imposed as the 
condition s? > sz,/(s?—03) for all the restrictions in (29.20) to be satisfied and (29.24) 
to be the ML estimator. If these conditions are not satisfied, (29.19) does not give the 
ML estimator, which is (cf. Exercise 29.18) 
dy = (i, (29.25) 
Note that as 05 —> 0, | &, | in (29.24) tends to its lower bound in (29.23), while (29.25) 
is its upper bound there. 


Case 2: o7 known 
(29.21) gives 
by = (S,—-08)/Siny Sin # O (29.26) 
and, as in Case 1 above, the conditions s? > 02, s? > sz,/(s?—o2) ensure that all the 
restrictions in (29.20) are satisfied and (29.26) is the ML estimator. Failing these 
conditions, the ML estimator is the analogue of (29.25), 


ai Se, / Ss 


The last sentence of Case 1 applies here, too, with obvious modifications. 


Case 3: 02/03 known 
This is the classical method of resolving the identifiability problem. Putting 
o? = jo}, elimination of o} between the equations of (29.21) gives 
at Seq + Xy(AS? — 57) —Ase, = O. (29.27) 
Unless s:, = 0 (in which case &, = 0 unless s/s? = 2, when &, is indeterminate and 
62 = 0—see (29.20)(vi)) this quadratic has necessarily non-zero roots 
(s— Ase) + {(s;— Ase)? + 44s¢,}4 _  N- 
2Sen 28 en 
say. By (29.19)(e), 62 = sz,/a, = 282,/N, so to satisfy 6; > 0, N must be non-negative 
and therefore the positive square root must always be taken in it. Thus 
aS (s? —Asz) + {(s2 —Asz)?, + 4/52, }# 
me 2Sin 


(29.28) 


(29.29) 


Ny 
= Sen 
6? = dof. (29.20)(ii) requires that 257 > N,. Replacing sz, by its upper bound 


s?s* establishes this. (29.29) is therefore the ML estimator. 


, say, and 62>0. We need only check that (29.20) (i) or (11) holds, since now 


Case 4: of and o% both known 

Only two unknowns («,, oz) now remain in (29.19) (c)-(e), and we can deduce both 
(29.24) and (29.26), which are inconsistent with each other. (29.19) therefore cannot 
give the ML estimators in this case, and we must maximize the LF directly, following 


Birch (1964a). 
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Using the moments in the last three equations of (29.18), 
2 
== log L = log|V | + | V |-* {se(az of + 02) —2se, a, of +52(02+03)} (29.30) 
where 
| V | = (oF + 05)(a, 0% + 02) — (x, 02)? = a2 oF oF + 02(02+ 03). 


We standardize the known constants o; and o, out of (29.30) for simplicity, measuring 
x and ¢ in units of o;, y and 7 in units of o,. (29.30) is then of the form 


2 2 
n se(1 +. v?) —2s.,uu +s, (1+?) 
G(u,v) = — 5410 Lent -e*)4 
(4, 2) 7 108 ( ) 1+u?+v? 
where u? = of, v2 = o2u®, G(u,v) is differentiable and —> — oo when (u? + v2) —> 00, 
so 1s maximized at a stationary value obtained by equating to zero the derivatives 


aG 9G We thence obtain 


du’ ov 
aG dG StU + S20 
| ee a A eae a 
(l+u )a, tue Ee a(t, u) 0, a 
kee oe —. Sal + 50 —v)=0 | 
Ou dv 14+u?+v? 


Eliminating (1+u?+v?) from these equations, we have 

u?{aise, + o4(s2 — 52) — Sen} = 0. 
Thus either we must have u? (and hence v? also) = 0 or the quadratic in braces must be 
equated to zero. In the latter case, we are back at (29.27) (remembering that we have 


made 4 = 1 by our standardizations), and on destandardizing (29.29) is again the 
solution for a. From (29.31), 


B+ Seq0— (1+ u* +02) = 0 

which gives, since v? = «2? u?, 

o2 = u? = (s2+a,5e,—1)/(1+22). (29.32) 
The stationary values taken by G at the points (--w,v) given by (29.29) and (29.32) 
will only be maxima if the stationary value at wu = v = 0 is not. From the second- 
order derivatives of G, this is when one or more of the conditions (s? — 1) > 0, (s?—1)>0, 
sz, > (1 —s2)(1—s%) is satisfied. Destandardizing, we therefore have that (29.29) and 
(29.32) give the ML estimators (&,, 62) when one or more of the conditions s?> 03, 
s2> 0%, Sz, > (05 —s?)(o2—s*) holds. If none holds (which seems unlikely in practice) 
u = v = Ois a maximum, so 6; = 0 and &, is indeterminate—cf. (29.20) (vi) and below 


it. 


The identifiability problem of 29.11 disappears if we have replicated observations, 
1.e., if there are 7; observations &;(j7 = 1,2,..., 7%) corresponding to the true value xi, 
and s; observations x(k = 1,2,..., s;) corresponding to y;, with at least one 7; and one 


Ti Si n n 
sj exceeding unity. We write &. = X&j/ri, m. = Une/yn, R=in, S = Xs, 
j=1 =1 i=1 i=1. 
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n n 
&.= Uung./R, n.. = Usini./S. Then it follows very‘simply that 
i=1 j=1 


= 


1 
f4- ¥y eH 


and 


S—n 

are unbiassed estimators of the error variances and that 1 = 62/6, is a consistent (though 
biassed) estimator of their ratio. Any of the estimators of «, in Cases 1-4 may now be 
used with the appropriate estimator substituted, and it will be consistent. Madansky 
(1959) discusses various estimators derived by methods similar to these, which are essen- 
tially simple applications of the ideas of the Analysis of Variance (Volume 3). Dorff 
and Gurland (1961a) make asymptotic variance comparisons when s;/r; is constant which 
favour the use of the estimator obtained by using / in (29.29). 


= 1 
= == (nie—7.)? 
4k 


Generalization of the structural relationship model 


29.13 As we remarked below (29.18), the structural relationship model discussed 
in 29.9-12 is a restrictive one because of the condition that all x; have the same mean, 
which implies the same for the y; We had 


EXE.) = E(x;) = p, all 2, (29.33) 
E(y:) = E(9;) = ao tayp, all 7. (29.34) 

Suppose now that we relax (29.33) and postulate that 
E(E,) = E(x;)-= py, tm 1 ais a oh (29.35) 


(29.34) is then replaced by 

E(i) = Xo toy Mie (29.36) 
This is a more comprehensive structural relationship model, which may be specialized 
to the functional relationship model without loss of generality by putting o2 = o? = 0, 
so that X; = a, Y¥; = 4,4, %,. 

However, in taking this more general model, we have radically changed the estima- 
tion problem. For all the u; are unknown parameters, and thus instead of six para- 
meters to estimate, as in (29.18), we have (+5) parameters. The essentially new 
feature is that every new observation brings with it a new parameter to be estimated, 
and it is not surprising that we discover new problems in this case. These parameters, 
specific to individual observations, were called “incidental” parameters by Neyman 
and Scott (1948); other parameters, common to sets of observations, were called 
‘structural.’ We have already encountered a problem involving incidental para- 
meters in Example 18.16. 

We have now to consider the ML estimation process in the presence of incidental 


parameters, and we shall proceed directly to the case of functional relationship, which 
is what interests us here. 


ML estimation of functional relationship 


29.14 Let us, then, suppose that (29.4), (29.5) and (29.6) hold, and that the 6, 
and ¢; are independent normal variables. Since the X; are mathematical, not random 
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variables, o2 = 0 and there are (n+ 4) parameters, namely «», «1, 03, oz and the values 
X,; Our Likelihood Function is 


1 1 
L cc oso; exp | —yea¥(E.—Xi)*—a (n.— (oot Xd} |. 


Differentiating log ZL with respect to each X; as well as the other four parameters, we 
find : 


—— = StS fn. (eo XD} =a Qj; pe 1,25 (29.37) 
dlogL 1 

5 = = aE fmi— (to +%1X)} = 0, (29.38) 
dlogL 1 

= = EX {ns— (t+ 91 X))} = 0, (29.39) 
dlog L a4 

a oe at ae a = (0), (29.40) 
dlog L #4 

30, = Pe a eee ee =. (29.41) 

Summing (29.37) over i, we find, using (29.38), 
&(€;—X;) = 0. 


Thus, if we measure the £; about their observed mean, we have the ML estimator of 
the sum of the X;,, 


(EX,) = VE, = 0. (29.42) 
Using (29.42), we have from (29.38) 
PTE = NX, 
and if we measure the 7; also about their observed mean this gives 
ao = 0. (29.43) 
Using (29.43), we find from (29.39) 
&y = UX,n,/UX}. (29.44) 
(29.40) gives 
while (29.43) in (29.41) gives 
52 = *¥ (—a X) (29.46) 


But squaring in (29.37), we have, using (29.43), 
(€;— = _ & (0 oY, (29.47) 
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and summing (29.47) over 7, we find from the ratio of (29.45) and (29.46) that we must 
have 

Oe = G1 Oy (29.48) 
Putting (29.48) back into (29.37) to eliminate «,, we find 


2 
63 656, 
so that the ML estimator of X;, satisfies 
= 5 (f+ gin) bates. (29.49) 


To evaluate the ML estimators of o3 and o?, we need to solve the (7+2) equations 
(29.45), (29.46) and (29.49) for the (n+2) unknowns X,, 6}, 62. Thence, we evaluate 
&, from (29.48). 

However, it is not worth proceeding with the ML estimation process, for (29.48), 
first deduced by Lindley (1947), shows that the ML method fails us here. We have 
no prior knowledge of the values of the parameters «1, 05, o%, and yet (29.48) gives a 
definite relation between the ML estimators, which is not true in the model as specified. 
In fact, (29.48) clearly implies that we cannot be consistently estimating all three of 
the parameters «,, 03, o2. The ML solution is therefore unacceptable here. 


29.15 It is, in fact, the general rule that, in the presence of incidental parameters, 
the ML estimators of structural parameters are not necessarily consistent, as Neyman 
and Scott (1948) showed. More recently, Kiefer and Wolfowitz (1956) have shown 
that if the incidental parameters are themselves independent, identically distributed random 
variables, and the structural parameters are identifiable, the ML estimators of struc- 
tural parameters are consistent, under regularity conditions. ‘The italicized condition 
evidently takes us back from our present functional relationship model to the structural 
relationship model considered in 29.9-12, where we derived the ML estimators of «, 
under various assumptions. Neyman (1951) had previously proved the existence of 
consistent estimators of «, in the structural relationship. 


29.16 It is clear from 29.14 that we cannot obtain an acceptable ML estimator 
of «, in the functional relationship without a further assumption, and indeed this was 
so even in the structural relationship case of 29.9-12, which our results and those quoted 
in 29.15 show to be essentially simpler. ‘This need for a further assumption often 
seems strange to the user of statistical method, who has perhaps too much faith in its 
power to produce a simple and acceptable solution to any problem which can be posed 
simply. A geometrical illustration is therefore possibly useful. 

Consider the points (&;, 1,;) plotted as in Fig. 29.1. _ : 

Any observed point (&,, 7;) has emanated from a “ true” point (X;, Y;) = (€;—6,, 
n;—€;) Whose situation is unknown. Since, in our model, 6; and e; are independent 
normal variates, (€;, 7;) is equiprobable on any ellipse centred at (X;, Y;), whose axes 
are parallel to the co-ordinate axes. Conversely, since the frequency function of 
(€,, 4;) is symmetric in (&,, ;) and (X,, Y,), there is an elliptical confidence region for 
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(X;,, Y;) at any given probability level, centred at (€;, 7;)._ These are the regions shown 
in Fig. 29.1. Heuristically, our problem of estimating «; may be conceived as that of 
finding a straight line to intersect as many as possible of these confidence regions. ‘The 
difficulty is now plain to see: the problem as specified does not tell us what the lengths 


eee a6 sh 
153,19) } 5 
Values 
of 
fi 


Values of & 
Fig. 29.1—Confidence regions for (Xi, Yi)—see text 


of the axes of the ellipses should be—these depend on the scale parameters 65, o,. 
It is clear that to make the problem definite we need only know the eccentricity of the 
ellipses, i.e. the ratio o,/o3. It will be remembered that in the structural relationship 
problem of 29.9-10, we found a knowledge of this ratio sufficient to solve the problem 
of estimating «. 


29.17 Let us, then, suppose that o2/of = A is known. If we substitute o2// for 
o; in our ML estimation process in 29.14, we find that the inconsistency produced 
by (29.48) does not occur, since we now require to estimate only one error variance, 
say o%. Equations (29.40) and (29.41), which produced (29.48), are replaced by the 
single equation 
ee — a SEE X45 Elna Xd} = 0, 


which gives, since (29.43) (and (29.44) ) remain valid, 


62 = 5 (AE (E— X,)'+3(m eX Y, (29.50) 


Instead of (29.49), we now have, direct from (29.37), 
A(é;— Xi) +4, (ni—& Xi) = 0, 
or 


gsi Betis (29.51) 


Putting (29.51) into (29.44), we have 
- (A+ 6d) {ADEN +4, 277} 
1 PDE PD n+ 208, Din 
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which simplifies to | 
GLEN +h (ALE}—U7j)-Ad Ein, = O. (29.52) 


(29.52) is just (29.27) written in a slightly different notation. Thus the result of 29.12, 
Case 3, holds good: (29.29) is the ML estimator of «, in the linear functional relation- 
ship, as well as in the simplest structural relationship, when the error variance ratio 4 
is known. 


29.18 As we remarked at the end of 29.10 (in a discussion which applies here since 
we estimate «, as in Case 3 of 29.12), the estimated regression lines “‘ bracket ’’ the 
estimated functional line. This also follows from the fact that &, defined at (29.29) is 
a monotone function of A (the proof of this is left to the reader as Exercise 29.1). ‘Thus 
the estimated regression lines set mathematical limits to the estimated functional line. 
However, these limits may be too far apart to be of much practical use. In any case, 
they are not, of course, probabilistic limits of any ktnd. 


29.19 Knowledge of the ratio of error variances has enabled us in 29.17 to evaluate 
ML estimators of «, and o2, namely (29.29) and (29.50). But our troubles are not 
yet over, for although &, is a consistent estimator of «,, 62 is not a consistent estimator 
of o%, as Lindley (1947) showed. 

To demonstrate the consistency of &,, we observe from the general results of 
Chapter 10 that the sample variances and covariance in (29.29) converge in probability 
to their expectations. ‘Thus, if we write the variance of the unobservable X; as S%, 
we have (cf. (29.18) for the structural relationship) 

2 
$—> Si +03 = Sk +5, 
> of SE+o% = af SE-+ol, 
Sén —> a, =. 

Substituting (29.53) in (29.29), we see that 

& > ( {02 S¥ +1031 (S% +03) } + [ {a2 S¥+ 2.05 —A (SE +05) }? +4 (a1 SE)? }*)/ (2a Sk} 
=i, (29.54) 

which establishes consistency. ‘The same argument holds for the structural relation- 

ship with o? replacing S} throughout. 

The inconsistency of 62 in the functional relationship is as simple to demonstrate. 
Substituting (29.51) into (29.50), we have the alternative forms 


; Aa: el ; 
6 = Fay é,), (29.55) 


(29.53) 


7 A A : 
— D(A a) 1 A Sin tO S2). (29.56) 
1 
Using (29.53) and (29.54) in (29.55), we have 


2 
62—> A {2S +08 Dat Sk + af (si+3)} = 


Tee : (29.57) 
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This substantial inconsistency in the ML estimator reminds one of the inconsistency 
noticed in Example 18.16; the difficulty there was directly traceable to the use of 
samples of size 2 together with the characteristic bias of order 1/n in ML estimators. 
Here, too, we are essentially estimating o2 from the pairs (£;, 7;), as the form (29.55) 
for 6; makes clear. The inconsistency of the ML estimator is therefore a reflection 
of the small-sample bias of ML estimators in general. This particular inconsistent 
estimator causes no difficulty, a consistent estimator of o? being given by replacing the 
number of observations, 2”, by the number of degrees of freedom, 2n—(n+2) = n—2, 
Fears 
9 Fe 

We have thus seen that in the functional relationship, even knowledge of 2 = 02/03 
is not enough for ML estimators to estimate all structural parameters consistently. 
For some structural relationships, the consistency of the ML estimators of structural 
parameters is guaranteed by the Kiefer-Wolfowitz result stated in 29.15 above. 


in the divisor of 67. The consistent estimator is therefore 
| n 


Example 29.1 
R. L. Brown (1957) gives 9 pairs of observations 
ae De 5-8 i 9-3. 10% 334. -147 — 160 
Hi OY 125 200 SO 2 
which were generated from a true linear functional relationship Y = «,+«,X with 
error variances os = oz. ‘Thus we have 4 = 1, n = 9, and we compute 
ué = 86-4, epee 2083 
c= os 9 => 23°14, 
and, rounding to three figures, 
nsp = 238, ° ns; = 906, ms; = 451. 


= ( ? 238) { (906 — 238)? + 4 (451)? }3 
1 — 


2x451 
668117 
oo = 1°99. 


If we measure from the observed means, therefore, we have & = 0 by (29.43) and 
the estimated line is 


Y—23-14 = 1-99(X—9-57) 


or 
Y = 1:99X+ 4-01. 
The consistent estimator of o2 is, by 29.19, s? = =o where 6? is defined at 
(29.56). We thus have as our estimator in this case 
f= 5 pga 2A sin + ED 
1 


= - —(3°9 .9Q)\2 ae = 
7 (f+ 1.992) (206 (3-98%451) + (1-99) 238} = 1:53. 
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In point of fact, the data were generated by adding to the linear functional relationship 
Cf) = 2652) 

random normal errors 6, e with common variance o2 = 1. Thus the estimators, par- 

ticularly &,, have performed rather well, even with 7 as low as 9. 


Confidence interval estimation and tests 
29.20 So far, we have only discussed the point estimation of the parameters. 
We now consider the question of interval estimation, and turn first to the problem of 
finding confidence intervals (and the corresponding tests of hypotheses) for «, alone, 
which has been solved by Creasy (1956) in the case where the ratio of error variances / 
is known. We can always reduce this to the case 1 = 1 by dividing the observed values 
of 7 by 4!. Hence we may without loss of generality consider only the case where the 
error variances are known to be equal. In this case, the Likelihood Function is 
n n 
Lt oc™exzpi——.[ 3 #4 5 (29.58) 
20% t=] $21 
whether the relationship is structural or functional. Maximizing (29.58) is the same 
n 
as minimizing the term in parentheses, which may be rewritten as & (d7+<¢;). We 
i=1 


therefore see, by Pythagoras’ theorem, that the ML estimation procedure minimizes the 

sum of squares of perpendicular distances from the observed points (&;, 7;) to the esti- 

mated line. This is intuitively obvious from the equality of the error variances. 
We now define 


= a (29.59) 


&, = tang, 
and we have at once from (29.29) and the invariance of ML estimators under transforma- 
tion that the ML estimator of tan 20 is 
2 tan O SS eee 29 60 
1—tan?6 1-82 |s?—s?|’ ie 
the modulus in the denominator on the right of (29.60) ensuring that the sign of 
tan 20 is that of &, and 5z,. 
If and only if «, = 0, € and are uncorrelated, by (29.18), and since they are normal, 
this implies that their observed correlation coefficient 
Yr = Sey/(S¢ Sy) 
will be distributed in the form (16.62), or equivalently, by (16.63), that 
t = {(n—2)r2/(1—7?)}# (29.61) 
is distributed in “ Student’s ”’ form with (n—2) degrees of freedom. Since 
sin?2) = tan?26/(1+ tan? 26), 
(29.61) may be rewritten, using (29.60), as 
~ [2 (52 — 52) 927)3 
i {(n—2) sin? 26 ? (ss el | (29.62) 


Ree) 24 
Se Sy — Sén 


tan 26 = 


The statistic (29.61) or (29.62) may be used to test the hypothesis that «, = 0 = 0. 
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29.21 If we wish to test the hypothesis that «, takes some non-zero value, a difh- 

culty arises, for the correlation between é and 7 is seen from (29.18) to be, with o? = o%, 

PS ay Or 

° {(o3 02 +02) (02 +02) }?” 

a function of the unknown o2 (for which we understand S? in the functional case, as 
previously). To remove this difficulty we observe that for any known value «,, and 
therefore of 0, we can transform the observed values (é;, 1;) to new values (&;, 7;) by 
the orthogonal transformation 

é’ = nsinO+écos8, 

He = aoe et 
which simply rotates the co-ordinate axes through the angle 0. Thus to test that «, 
takes any specified value, we simply test that «, = 0 = 0 for the transformed variables 
(é’, 7’). Since variances and covariances are invariant under orthogonal transforma- 
tion, this means that (29.62) remains our test statistic, except that in it 6 is replaced 
by (6-6). 

There remains the difficulty that to each value of ¢ in (29.62) thus modified, there 
correspond four values of 0, as a result of the periodicity of the sine function. If we 
may take it that the probability that | §—6| exceeds 4 is negligible, the problem dis- 
appears, and we may use (29.62), with (6 —6) written for 0, to test any value of a, = tand 
or to set confidence limits for 0 and thence «,. ‘The confidence limits for @ are, of 
course, simply : 

bit arcsin 2¢ (—aftaeaa) | (29.63) 
: (n—2)[(8-s7)?+48%]J) J’ . 
where ¢ is the appropriate ‘‘ Student’s ” deviate for (n—2) degrees of freedom and the 
confidence coefficient being used. Because of the condition that |6—6| < 4a, this is 
essentially a large-sample method. 


Example 29.2 

For the data of Example 29.1, we find 

6 = arc tan 1:99 = 0-35x 
and for 7 degrees of freedom and a central confidence interval with coefficient 0-95, 
we have from a table of ‘“ Student’s ” distribution 
be 255. 

Thus (29.63) becomes, using the computations for &,, 

| (238 x 906 — oy] 

V7 x 1122 

= 0-:35a+arcsin0:1742 = 0-352 + 0-032. 

The 95 per cent confidence limits for are therefore 0-32 and 0-382. ‘Those for a, 
are simply the tangents of these angles, namely 1-58 and 2:53. The ML estimate 1-99 


is not central between these limits precisely because we had to transform to 6 to obtain 
them. The limits are rather wide apart as a result of the few degrees of freedom 


available. 


0-350 4d eth [472 
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29.22 As well as setting confidence limits for «, in the manner of the preceding 
section, we may, as R. L. Brown (1957) pointed out, find a confidence region for the 
whole line 1f the error variances are both known. For notwithstanding the fact proved at 
(29.9) that the error term (¢—« 0) is correlated with €, we may rewrite (29.8) as 


N—(e%&o+4,€) = e—a,0. (29.64) 
The right-hand side of (29.64) is a normally distributed random variable with zero 
mean and variance of + aj 03, and the left-hand side of (29.64) contains only the observ- 
ables £, 7 and the parameters «», «,. We thus have the fact that 
= {i — (%o + %1&,) } 


oF + a2 oF 


(29.65) 


is distributed in the chi-squared form with n degrees of freedom (7). If o? and o? 
are both known, we may without loss of generality take them both to be equal to unity, 
since we need only divide €; by o; and 7; by o, to achieve this. (29.65) then becomes 
the 2 variate 


$ {Ni— (%o + o8;) }? 

i=1 1+ aij 
We may use (29.66) to find a confidence region for the line. For if we determine 
c, from tables of y? by 


(29.66) 


Plin > GY} =1—y, 
we have probability y that 


X {yi —(%o+08;) }? < c,(1 +079). (29.67) 
i=1 
Measuring from the observed means é, 7 as before, (29.67) becomes 
So + a6 + af s?— 204 Sen < c,(1 +02) 
or 
at (s?—C,)— 204 Sz, toe < c,—s%. (29.68) 
If we take the equality sign in (29.68), it defines a conic in the (%9, «,) plane. If c, is 
increased (i.e. y is increased), the new conic lies inside the previous one. The conic 
is a 100(1—+) per cent confidence region for (%, a). 

This confidence region is bounded if the conic is an ellipse, but unbounded if the 
conic is a hyperbola. ‘There may, in fact, be no real values of (9, «1) satisfying (29.68). 
We have already discussed this difficulty in another context in Example 20.5. 

We may now, just as in 28.26, treat (29.68) as a constraint which the true line must 
satisfy, and then find the envelope of the family of possible lines, which will again 


be a conic, by differentiation. The result, given by R. L. Brown (1957), is the region 
in the (X, Y) plane bounded by 


(Y—4,X)*_ (6 ¥+X) 
Cy—b, b,—Cy 
where «, is defined by (29.29) and 


s : 
5. = gat, by = St+G4 Sep. 


= 


= 1462 (29.69) 


CC 
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&, is assumed positive, so that b, > b,. (29.69) is the required confidence region, 
which is a hyperbola if b, < c, < by, an ellipse if c, > b,. If c, < b,, the conic is 
not real. 


29.23 The result at (29.69) is expressed in terms of the estimator (29.29), which 
may not be the ML estimator when both error variances are known, as was assumed in 
29.22—-see the discussion of Case 4 in 29.12. | 

Quite apart from this point and the conceptual difficulties arising from confidence 
regions for the line in the cases stated at the end of 29.22, another remark is worth 
making. It is not an efficient procedure to set confidence limits for the mean of a 
normal distribution with known variance by using the y? distribution of the sum of 
squares about that mean—we remarked this in Example 20.5 and again in Example 25.3, 
where we showed that the ARE of the corresponding test procedure is zero. We thus 
should not expect the confidence region (29.69), which is based essentially on this in- 
efficient procedure, to be efficient. It is given here only because better results are not 
available. 


Linear functional relationship between several variables 
29.24 We now consider the estimation of a linear relationship in k variables. ‘To 
make the notation more symmetrical, we will consider the variables X,, Xq,..., Xx 
and the dummy variable X)( = 1) related by the equation 
k 
D> a; X; — 0. (29.70) 
j=0 
Apart from X,, the variables are subject to error, so that we observe &; given by 
E,, = X5,+954, t= 1,4 2558 - i Se (29.71) 
Of course, £; = Xo; = 1 for alli. As before, we assume the 0’s normally distributed, 
independently of X and of each other, with zero means ; and we make the situation 
identifiable by postulating knowledge of the ratios of error variances. If we suppose 
that 


vard, _vard, _ var 0; 
7, " i (29.72) 
where the 2’s are known, we may remove them from the analysis at the outset by 
dividing the observed €; by /A;. These standardized variables will all have the same 


unknown variance, say 93. 
The logarithm of the Likelihood Function of the error variables is then 


1 ko 
log L = constant — nk log o5— 32 2 = (E,,—Xy)%. (29.73) 
If we regard our data as m points in k dimensions, the problem is to determine the 
hyperplane (29.70). Maximizing the likelihood is equivalent to minimizing the double 
sum in (29.73) and this is the sum of squares of distances from the observed points € 
to the estimated points X, as in (29.58). This sum is a minimum when the estimated 
X’s are the feet of perpendiculars from the é’s on to the hyperplane. Thus the ML 
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estimator of the hyperplane is determined so that the sum of squares of perpendiculars 
on to it from the observed set of points is a minimum. 


29.25 ‘This is a problem of familiar type in many mathematical contexts. The 
distance of a point &),, &5;,...,&:; from the hyperplane (29.70) is 


k k 3 
>> atu / ( >» 3) ° 
j=0 j=0 


The quantity to be minimized is then 


n k 2 k 
S = p> ( p> ast.) / >> at. 
+=1 \7=0 j=0 


It is convenient to regard this as a minimization of 
S’ ee, a; & 5; : 
& ( ais Ei) (29.74) 


subject to the constraint 
xa" = constant, 


and we may take the constant to be 1, without loss of generality, by (29.70). We 
have then to minimize unconditionally 

2 (Bases)? — pha; 

tv j j 
where yw is a Lagrange undetermined multiplier. Differentiating with respect to «,, 
we have 


2 Fi (Bas$,0) = [MQ l = 0, :. ee ey Rk, (29.75) 
% J 


The first of these equations, with / = 0, (9; = 1) may be removed if we take an origin 
at the mean of the &’s, i.e. | 


x fj; = 0. 
Writing c,; for the covariance of the ith and jth variate, we then have, from Ces Begs 
Saee< tt bord, De, tise: (29.76) 


Taking the right-hand terms over to the left and eliminating the «’s between these 
equations, we find 


Ul 
oe Cie C13 Crk 
Ub 
Cie {ss C23 Cox 
11 
u | a 
i — 
2 ee C33—— a 0. (29.77) 
; uw 
C1 C 2% C 3k a 
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If we divide row 7 by the observed standard deviation of &;, S;, and column 7 by S,, 
and write 7,; for correlations, (29.77) becomes 


1-6, fis ee ee eee 
T19 1-6, ' 93 eee V ox 
fen = 3 =e Le te 
: = i. (29.78) 
rn = ra, ee +4, 


where 0; = u/(nS%). 

We can solve (29.78) for u and hence find the «’s from (29.76). In actual computa- 
tional practice it is customary to follow an iterative process which evaluates the «’s 
simultaneously. ‘These solutions for the «’s are, of course, the ML estimators of the 
true values. 

Note that (29.78) is an equation of degree k in yu, with k roots (which, incidentally, 
are always real since the matrix, of which the left-hand side is the determinant, is non- 
negative definite). We require the smallest of these roots, for if we multiply the /th 
equation in (29.75) by «, and add the last k of them, we find that the left-hand side 
sums to (29.74). Hence 


S = 225.9 = # (29.79) 


is to be minimized. 


29.26 The method of 29.22 for two variables can be extended to give a quadric 
surface as confidence region in k dimensions for «1, %,...,%. Likewise a quadric 
can be found “ within”? which should lie the hyperplane (29.70) representing the 
functional relation. Such regions are, of course, difficult to visualize and impossible 
to draw for k > 3. Reference may be made to R. L. Brown and Fereday (1958) for 
details. (Some of their results are given in Exercises 29.6-8.) ‘The remarks of 29.23 
will apply here also. 


Villegas (1961) considers the case where the error variances are unknown and they 
are estimated from replicated observations (cf. 29.12) by ML, and (1964) bases a con- 
fidence region for the linear relation on these results. His discussion covers the case 
of correlated errors. 

Sprent (1966) gives a general method of estimating the coefficients when the errors 
are correlated. 


29.27. So far, we have essentially been considering situations in which identifi- 
ability is assured by some knowledge or assumption concerning the error variances, 
or by replicated observations. The question now arises whether there is any other 
way of making progress in the problem of estimating a linear functional or structural 
relationship. Different approaches have been made to this question, which we now 
consider in turn. 
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Geary’s method of using product-cumulants 


29.28 ‘The first method we consider was proposed by Gbuss (1942b, 1943) in the 
structural relationship context, but applies also to the functional relationship situation. 
We write the linear structural relationship in the homogeneous form 


Ly Mya Kot ... +H, x, = 0. (29.80) 


Each of the x; is subject to an error of observation, 6;, which is a random variable 
independent of x; and the observable is §; = x;+6,;. ‘The 6; are mutually independent. 
Consider the joint cumulant-generating function of the é;. It will be the sum of the 
joint c.g.f. of the x; and that of the 6;. The product-cumulants of the latter are all 
zero, by Example 12.7. Thus the product-cumulants of the other two sets, the &; 
and the x,, must be identical. If we write «, for cumulants of the x’s, x: for cumulants 
of the é’s and write the multiple subscripts as arguments in parentheses we have 


Kz (P1y Pay ++ +> Pe) = Ke (Pu Poy +++» De); (29.81) 


provided that at least two p; exceed zero. ‘Thus the product-cumulants of the x’s 
can be estimated by estimating those of the é’s. 


29.29 The joint c.f. of the x’s, measured from their true means, is 


k 
Olts, teva.» 5 ty) = B{exp( x i,x,)\, (29.82) 
j=1 
where 0; = it;. Differentiation of (29.82) with respect to each 6; yields 
ae | 
= nr a0, = ets exp —— = 0, (29.83) 
using (29.80). For the c.g.f. y = log¢ also, we have from (29.83) 
oy 1 0g 
eh) 20, a goa 26, =. Q), (29.84) 


Since, by definition, 

GP: QP gre 
a = aes Bees Ee 
Y K(P1 Pe POs fel atl 
we have from (29.84) for all p; > 0 


aK (Pit1, pa...) Pe) tea (Pr, Potl, ..., Pe)t .-.. HOOK (Pr, Pa. .-, Pet) = 0. 
(29.85) 


The relations (29.85) will also be true for the product-cumulants of the observed &’s, 
in virtue of (29.81), provided that at least two of the arguments in each cumulant 
exceed zero, i.e. if two or more p; > 0. In the functional relationship situation, the 
same argument holds. The random variable x,, on which n observations are made, 
is now replaced by a set of m fixed values X;,,..., Xjn. If this is regarded as a finite 
population which is exhaustively sampled, our argument remains intact. 
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29.30 Unfortunately, the method of estimating the «; from (29.85) (with estimators 
substituted for the product-cumulants) is completely useless if the x’s are jointly 
normally distributed, the most important case in practice. For the total order of each 


k 
product-cumulant in (29.85) is & p,+1 > 3 since two or more p; > 0. All cumulants 
i=1 


of order > 3 are zero in normal systems, as we have seen in 15.3. ‘Thus the equations 
(29.85) are nugatory in this case. ‘This is not at all surprising, for we are dealing here 
with the unidentifiable situation of 29.9, and we have made no further assumption to 
render the situation identifiable. 

Even in non-normal cases, there remains the problem to decide which of the relations 
(29.85) should be used to estimate the k coefficients «;. We need only k equations, but 
(assuming that all cumulants exist) have a choice of an infinite number. ‘The obvious 
course is to use the lowest-order equations, taking the p, as small as possible, for then 
the estimation of the product-cumulants in (29.85) will be less subject to sampling 
fluctuations (cf. 10.8(e)). However, we must be careful, even in the simplest case, 
which we now discuss. 


29.31 Consider the simplest case, with k = 2, which we specified by (29.13). 
We rewrite this in the form «,x—y = 0, which is (29.80) with x=, y= xy, 
a, = —1, a, = 0 because we are measuring from the means of x and y. (29.85) 
gives in this case the relations 

a%1«(pit1, p2)—«(Pi, p2tl) = 0 
or, if «(p,+1, p.) 4 0, 

= (p 1) p a+ 1) 

a, = oO - ,” 29.86 

. as (p 1 v L, Pp 2) ( ) 
This holds for any ~;, pz > 0, and is therefore, as remarked in 29.30, useless in the 
normal case. Even if the distribution of the observables (¢, 7) is not normal, its 
marginal distributions may be symmetrical, and if so all odd-order product-moments 
and hence product-cumulants will be zero. Thus even in the absence of normality, 


we must ensure that (p,+).+1) is even in order to guard against the danger of sym- 
metry. The lowest-order generally useful relations are therefore 


$= 1,2. = 2c 4, = — 29.87 
i, = ra Po hy = Kgo/Ka1) | 


the cumulants being those of (£, 7), which are to be estimated from the observations. 

There remains the question of deciding which of the relations (29.87) to use, or 
more generally, which combination of them to use. Madansky (1959) suggests finding 
a minimum variance linear combination, but the algebra would be formidable and not 
necessarily conclusive in the absence of some assumptions on the parent (&, 7) 
distribution. 

Even in the absence of symmetry, we may still be unfortunate enough to be sampling 
a distribution for which the denominator product-cumulant used in (29.87) is equal 
to zero or nearly so; then we may expect large sampling fluctuations in the estimator. 


| 
jek 
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Example 29.3 


Let us reconsider the data of Example 29.1 from our present viewpoint. We find, 
with n = 9, 
Sx, = U(E—€)?(y—N) 445-853 = Nor 
Sig = U(E-E)(n—-7)) 542:877 = nly, 
S33 = U(E—E)*(N—-H) = 24,635-041 = nygi, 
= (€—&)2(4—7%)? = 46,677-679 = noo. 


l 
| 


S22 
Thus (3.81) gives the observed cumulants 

Koy = flgy = 49-5395 ky. = ye = 60°320 ; 

Kg1 = M31—3feo fay = — 1232-45 ; 

Koo = Moe—Maoloz— 2 = — 2493-613. 
Using these values in equation (29.86) we find the estimate of «,: 


sine gO: 32002 


= = ] & SS DS CO . F 
Pi = 4, Ps x, = 49-539 22, (29.88) 


while from the second equation in (29.87), we have the much closer estimate 

= iyi ag ek 499-018 

Poole a ees ee 

It might be considered preferable to use k-statistics instead of cumulants in these 
equations. From 13.2 we have, since we are using central moments, 


= 2-02. (29.89) 


k =a NS 94 
2) (n—1)(n—2Y 


NS +9 


Rie 


s (n—1)(n—2) 
k _ n(n+1)s31—3 (n— 1) 511 S20 
ie (n—1)(n—2)(n—3) ° 
= n(n+1)S:2—2(m—1)s1,—(n—1) S20S02 


(n—1)(n—2)(n—3) 
The use of &-statistics rather than sample cumulants as estimators therefore makes no 
difference to the estimate (29.88). We find 
Rkegy = —1057-19, Ree = —2308-79, 

— 2308-79 
— 1057-19 
It will be remembered that these data were actually generated from random normal 
deviates. It is not surprising, therefore, that the estimate (29.88) is so wide of the 
mark. (The ML estimator in Example 29.1 was 1-99.) The remarks in 29.30—1 
would lead us to expect this estimator to behave very wildly in the normal case, since 
it is essentially estimating a ratio of zero quantities. 


and the estimate (29.89) is now replaced b aes (29.90) 
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It will be noticed that (29.89) is slightly closer to the ML estimator than the appar- 
ently more refined (29.90). This ‘‘ refinement ”’ is illusory, for although the k-statistics 
are unbiassed estimators of the cumulants, we are here estimating a ratio of cumulants. 
Both estimators are biassed ; (29.89) is slightly simpler to compute. 

The reader may like to verify that if the first equation in (29.87) is used we find 
M143 = 10,003, «;3 = —5131 and thus the estimate x,3/k2. = 2:06, very close to (29.89). 

In large samples from a normal system, none of our estimators would be at all 
reliable. 


29.32 We conclude that the product-cumulant method of estimating «,, while it 
is free from additional assumptions, is vulnerable in a rather unexpected way. It 
always estimates «, by a ratio of cumulants, and if the denominator cumulant is zero, 
or near zero, we must expect sharp fluctuation in the estimator. This is not a 
phenomenon which disappears as sample size increases—indeed it may get worse. 


The use of supplementary information: instrumental variables 

29.33 Suppose now that, when we observe & and 7, we also observe a further 
variable £, which is correlated with the unobservable true value x but not with the 
errors of observation. The observations on ¢ clearly furnish us with supplementary 
information about x which we may turn to good account. €¢ is called an instrumental 
variable, because it is used merely as an instrument in the estimation of the relationship 
between y and x. We measure &, 7 and ¢ from their observed means. 

Consider the estimator of «, 


a= = con / = Gey (29:91) 


which we write in the form 


or, on substitution for 7 and é, 
a, ul; (x;+0,) == ue; (% +a, %;+&;). (29.92) 


Each of the sample covariances in (29.92) will converge in probability to its expectation. 
Thus, since ¢ is uncorrelated with 6 and «, we obtain from (29.92) 


a, cov (¢, x) —> «, cov (¢, x). (29.93) 
If and only if 
lim cov(¢, x) 4 0, (29.94) 
n—> oo 
(29.93) gives 
Ay Te, (29.95) 


so that a, is a consistent estimator. It will be seen that nothing has been assumed 
about the instrumental variable ¢ beyond its correlation with x and its lack of correla- 
tion with the errors. In particular, it may be a discrete variable. Exercise 29.17 gives 
an indication of how efficient a, is. See also Exercises 29.15—16. 
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29.34 Whatever the form of the instrumental variable, it not only enables us to 
estimate «, consistently by (29.91) but also to obtain confidence regions for (%», «,), 
as Durbin (1954) pointed out. 

The random variable (7—« )—«,£) = e—a,6 by (29.8). Since ¢ is uncorrelated 
with 6 and with e, it is uncorrelated with (7—a )—«, &). It follows (cf. 26.23 (a) ) that, 
given a, and «,, the observed correlation r between ¢ and (n—a —«, &) is distributed 
so that 


t? = (n—2)r?/(1 —71°) (29.96) 
has a “ Student’s ” 22-distribution with (n—2) degrees of freedom. If we denote by 
ti_,, the value of such a variate satisfying 

Pir < f.,| = 1-7, 
we have, since r? = #2/{t2+(n—2)}, a monotone increasing function of 2°, 

XC (n—%y— 0,4) |? 
PiSBD Gwent) * Ho} a i-y 


or 


(2 ly)?— 2a, ly mlE+or(Vcé? — 2 | _ 4 
FP ‘3 F(S qh nod— 2a, Dn + 08S F) < ry} = 1—y. (29.97) 
It will be seen that (29.97) depends only on a» and «, apart from the observables 
t,n, €. It defines a quadratic confidence region in the (a9, «,) plane, with confidence 
coefficient 1—y. If a» is known, (29.97) gives a confidence interval for «,, but 7? 
now has (n—1) degrees of freedom, since only one parameter is now involved. We 
shall see later that for particular instrumental variables, confidence intervals for «, may 
be obtained even when «, is unknown. 


29.35 The general difficulty in using instrumental variables is the practical one 
of finding a random variable known to be correlated with x and known not to be corre- 
lated with 6 and with e: we rarely know enough of a system to be sure that these 
conditions are satisfied. However, if we use as instrumental variable a discrete ‘‘ group- 
ing” variable (i.e. we classify the observations according to whether they fall into 
certain discrete groups, and treat this classification as a discrete-valued variable) we 
have more hope of satisfying the conditions. For we may know from the nature of 
the situation that the observations come from several distinct groups, which materially 
affect the true values of x; while the errors of observation have no connexion with this 
classification at all. For example, referring to the pressure-volume relationship dis- 
cussed in 29.1, suppose that (29.2) were believed to hold. If we take logarithms, the 
relationship becomes 

log P = C-ylogV, : 
precisely the form we have been discussing, with a) = C and «, = y. But see 29.55. 

Suppose now that we knew that the determinations of volume had been made some- 
times by one method, sometimes by another; and suppose it is known that Method 1 
produces a slightly different result from Method 2. The Method 1—Method 2 classi- 
fication will then be correlated with the volume determination. ‘The errors in this 
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determination, and certainly those in the pressure determination (which is supposed to 
be made in the same way for all observations), may be quite uncorrelated with the 
Method classification. Thus we have an instrumental variable of a special kind, 
essentially a grouping into two groups. 

We now discuss instrumental variables of this grouping kind in some detail. 


Two groups of equal size 

29.36 Suppose that 7, the number of observations, is even, and that we divide 
them into two equal groups of $n = m observations each. (We shall discuss how the 
allocation to groups is to be made in a moment.) Let & be the mean observed & in 
the first group and &’ that for the second group, and similarly define 7 and 7’. Then 
we may estimate «, by putting an instrumental variable ¢ equal to +1 for each observa- 
tion in the first group and —1 for each observation in the second group. (29.91) 
then becomes 


/ 


7-1 
a, = > 29.98 
and using (29.98) in the model, we estimate %» by 
ay = (9 +7) 41 (F' +8). (29.99) 


Geometrically, this procedure means that in the (, 7) plane we divide the points into 
equal groups according to the value of &, and determine the centre of gravity of each 
group. The slope of the true linear relationship is then estimated by that of the join 
of these centres of gravity. 

Wald (1940), to whom these estimators are due, showed that a, defined at (29.98) 
is a consistent estimator of «, if the grouping is independent of the errors and if the 
true x-values satisfy 


lim inf|*#’—x| > 0, (29.100) 
n—> © 
which is (29.94) again, since here cov (é, x) = #’—#. (29.100) clearly will not be 
satisfied if the observations are randomly allocated to the two groups, when 


lim |%’-%|=0. (Cf. Exercise 28.16 for the simple linear model.) Nor is it satis- 
n—> © 
factory to allocate the m smallest observed &s to one group and the m largest to the 


other—Neyman and Scott (1951) show that in this case the estimator will not be 
consistent. (It is easy to see that the grouping is not now, in general, independent of 
the errors.) It follows that Wald’s method is only of interest if we have prior informa- 
tion (like that mentioned in 29.35) to validate (29.100). 


29.37 We may use the estimator (29.98) to obtain estimators of the two error 
variances. For since, by (29.18), 


op = varg oO, 


0 (29.101) 


o2 = varn—a,cov(é, 7), 
we need only substitute the consistent estimators s?, s; and s;, for the variances and 
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covariances (multiplying each by 7 t0 temove bias), and a, for «,, to obtain the 
n— 3 


n s 
S$ = 52" ) 
n—1 ay 


nN 
Ce = rane Ae Stn) 


estimators 


(29.102) 


Example 29.4 
Let us apply this method purely illustratively to the data of Example 29.1. There 
are 9 observations, so we omit that with the median value of £, and take our two groups 
to be: 
€: 18 41 5:8 7:5; 106 13-4 14-7 18-9 
ne O9 125 200 15-7; 234 302 356 39-1. 


We find 


31 Mr 
| 


= 19-2/4 = 4-800; & = 57-6/4 = 14-400 
55:-1/4 = 13-775 ¢> -= 128-34. = 32-075. 
The estimate is 
2 32:075 — 13-775 
* 14-400 —4-800 
reasonably close to the true value 2. 
For these 8 observations, we find 


& = 29-795, SA 12-709, sz, = 56°764. 
Substituting in (29.102), we find the estimates 
ss = —0-054, 
ss =.5-16. 


These are very bad estimates, the true values being unity ; sj is actually negative and 
therefore ‘‘ impossible.” 

Inaccurate estimates of the error variances are quite likely to appear with these 
estimators, as we may easily see. If the true values (x, y) are widely spaced relative to 
the errors of observation, the observed values (&, 7) will be highly correlated, their two 
regression lines will be close to each other, and a, will then be close to the regression 
coefficient of 7 on é, sz,/sz, and to the reciprocal of the regression coefficient of € on n, 
sc, /S2. Thus, from (29.102), both s3 and s; will be near zero, and quite small variations 
in a, will alter them substantially. In our present example, the correlation between 
€ and 7 is 0-98, and even the small deviation of a, from the true value «, is enough to 
swing s3 violently downwards and s? violently upwards. 


= 1-91, 


29.38 A confidence interval for «, was also obtained by Wald (1940). For each 
of the two groups, we compute sums of squares and products about its own means, 
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and define the pooled estimators, each therefore based on (m—1)+(m—1) = n—2 
degrees of freedom, 


S? = 312 Ee -+ E (ei- -#y}, 


— 
a ‘ — 
7nd) 2 MP + E (ni 7 yt aes, 


— af 
Sy = 754 E -Dln-a)+ EG -FYh-a)} 


These three quantities, in normal variation, are distributed independently of the 
means é, &’, 7, 7’, and therefore of the estimator (29.98). In (29.101), we substitute 
(29.103) to obtain the random variables, still functions of «, 


S5 = S?— Seq /e1; 
pase tate (29.104) 


Now consider 


S? = S?+02 SF = S?+af SF—2a, Sz, 
1 
| = {(Ni—%o— %1 &4) — (fj —%y— a, &)}? 


n—2 
+ E ((ni—ao-ar8)—(7' aoe F)}"|. (29.105) 


(n — 2) S* is seen to be the sum n of two sums of squares; each of these is itself the sum 
of squares of m independent normal variables OR ee $i) about their mean, and 
from (29.8) we see that each of these has variance of+ajo3. Thus 

(n—2) S? 


o2 +0203 


has a 7? distribution with (n—2) degrees of freedom. We also define 
u = 4(&'—£)(a,—a1) = 3{(7’-7)—21 (€’—-€)} 
= £{(7’ —%9— a8") —(—a%9—0€)} = ${(é —a1 5’) —(E—a, 6) }. 
The two components on the extreme right, being functions of the error means in the 


separate groups, are independently distributed. We — see that u is normally dis- 


2 9 = 
; _ : O.+0;0 ‘ 
tributed with zero mean and variance 1 geet 108) _ “(02 +203). Moreover, u 1s 


a function only of &’, &, 7’ and 7, and is therefore distributed independently of S?. 
Thus 

= unt ge (&’—&) (a, —a)n! 

= 5 = 2(S?—2o Se, +02 52)! ae) 


has a “ Student’s” distribution with (n—2) degrees of freedom. For any given 
confidence coefficient 1—y, we have 


PY St} = T=». (29.107) 
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The extreme values of «, for which (29.107) is satisfied are, from (29.106), the roots of 
= 3 2 
(F —£)8(a,— ay)? = 44(S}— 2a Sey +08 SP) 


or 


2 = = = = 2 
a? a Se sy} + 2a, { a,(F—8)}— Sy} 


2 
. {iar s3— af e— Ey 6. (29.108) 
a quadratic equation in «, of which the discriminant is 
2 2 2 
(==) (8, — S352) + =» (a $2—2a, Sey + Si), (29.109) 


The first term in (29.109) is negative, by Cauchy’s inequality, and the second term 


es e e e e e 1 e 
positive, since its factor in brackets is Dani 481)”. If nm is large enough, the 
—% 5 


sis : eas Sis os 
positive term, which has a multiplier ——”, will be greater than the negative, with 
n 


<3 
a multiplier (==) . Then the quadratic (29.108) will have two real roots, which 


are the confidence limits for «,. 


29.39 Similarly, we may derive a confidence region for (%, a). From (29.99), 
we estimate «) by a. Consider the variable 


UV = Ay— ty = (Hf +7) —%o— 041 (E' +6). (29.110) 


v is normally distributed, with zero mean and variance 
1 1 
~ var (n—a% —%,&) = wl ee a? 05), 


i.e. its variance is the same as that of uw in 29.38. _v, like u, is easily seen to be distributed 
independently of S?, so that if we substitute v for wu in (29.106), we still have a 
“ Student’s ” t variable with (n—2) degrees of freedom. If «, is known, we may use 
this variable to set confidence intervals for a, the process being simple in this case, 
since «, appears only in the numerator of ¢. However, this is of little practical import- 
ance, since we rarely know «, and not dp. ? 

But we may also see that u and v are independently distributed. To establish this 
we have, by the definitions of u and v, only to show that 

2u = (q—7H)— 4, (F —5) 
is independent of a 
agtv = (9'+7)—%(F' +8). 

These two variables are normally distributed, the first of them with zero mean. Their 
covariance 1s 


E (ii +9) (9 —7) +08 E (E+ &) (F —§) — 20, E(iy’ &—9 6). 
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Each of the first two expectations is that of a difference between identically distributed 
Squares, and is therefore zero. ‘The third expectation is a difference of identically 
distributed products, and is also zero. Thus the covariance is zero, and these variables 
are independent. Hence u and v are independent. 


2 2 
It now follows that —“ 12 is a y* variate with 2 degrees of freedom and 
— (02 +43 05) 
n 
hence that 
1 (4,2 2 
F= acne (29.111) 


is distributed in the variance-ratio distribution with 2, n—2 degrees of freedom. From 
this, we may obtain a confidence region for « and «,, which is (cf. Exercise 29.5) an 
ellipse, as we should expect from the independence and normality of u and v, which 
are linear functions respectively of «, and a». 

This confidence region is not equivalent to that obtained by putting the instru- 
mental variable € = +1 in (29.97). Our present region is based on the distribution of 
= n (u*+v?) 

Yee 
on u* and is not a monotone function of F. Intuitively, the latter seems likely to give 
a better interval, but we know of no result to this effect. 


, but the random variable in (29.97) has a numerator depending only 


Three groups 

29.40 It was pointed out by Nair and Shrivastava (1942) and by Bartlett (1949) 
that the efficiency of the grouping method may be increased by using three groups 
instead of two, and estimating «, by the slope of the line joining the centres of gravity 
of the two extreme groups. (We have already done this implicitly in Example 29.4, 
where we omitted the central observation in order to carry out a two-group analysis.) 
The three-group method may be formulated as follows. 

We divide the n observations into three groups, the first group containing np, 
observations, and the third group mp, observations. p, and p, are proportions, the 
choice of which is to be discussed below. ‘The two-group method is a special case with 

| 1(n— 
Pi = Pz = § When 7 is even (when the middle group is empty) and p, = p, = ie 
when is odd (as in Example 29.4). The grouping now corresponds to an instru- 
mental variable ¢ in (29.91) taking values +1, 0 and —1 for the third, second and first 
groups respectively. ‘The estimator is 
a,=+—4 29-442 
1 Z me ‘3 ( ) 
as before, but the primed symbols now refer to the third group, and the unprimed 
symbols to the first group. The estimator is consistent under the same condition as 
before. 


Nair and Shrivastava (1942) and Bartlett (1949) studied the case p, = p, = 3. In 
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this case, as in 29.38, we define S?, S? and S:, in (29.103) by pooling the observed 
variances and covariances within the three groups, but now dividing by (m—3), the 
number of degrees of freedom in the present case. (29.104) then defines Sj and S? 
as before and S? = S?+a7S3. (n—3)S?/(o3+a{ 05) is a y? variate with (n—3) degrees 
of freedom. In this case 


(—H(a,—a1) (2) 


will be a normal variate distributed independently of S?, with zero mean and variance 
o%+a%03. Thus the analogue of (29.106) is 


= () See (29.113) 


distributed in “‘ Student’s ” distribution with (n—3) degrees of freedom, and we set 
confidence intervals from (29.113) as before. 
The results of 29.39 extend similarly to the three-group case. 


29.41 ‘The optimum choice of p, and p, has been investigated for various distribu- 
tions of x, assumed free from error. Bartlett’s (1949) result in the rectangular case 
is given as Exercise 29.11. Other results are given by Theil and van Yzeren (1956) 
and by Gibson and Jowett (1957). Summarized, the results indicate that for a rather 
wide range of symmetrical distributions for x, we should take p, = p, = § approxim- 
ately, the efficiency achieved compared with the minimum variance LS estimator being 
of the order of 80 or 85 per cent. 

The evidence of the relative efficiency of the two- and three-group methods in the 
presence of errors of observation is limited and indecisive. Nair and Banerjee (1942) 
found the three-group method more efficient in a sampling experiment. An example 
given by Madansky (1959) leads strongly to the opposite conclusion. 


Example 29.5 
Applied to the data of Example 29.1, the method with p, = p, = 4 gives 3 observa- 
tions in each group. We find 
o° = £567, PaaS 
qo= 34-97, = 439-15, 
whence 
— kg ae die gE 
*  15-67—3-90 
close to the value 1-91 obtained by the two-group method in Example 29.4, but actually 
a little further from the true value, 2. 
Halperin (1961b) develops a generalization of (29.108) and (29.111) which requires, 
instead of a grouping method, the guessing of the x; individually to permit maximization 


of the probability of obtaining a closed confidence interval or region; this is particularly 
advantageous for small z. 


= 1-86, 
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The use of ranks 


29.42 'To conclude our discussion of grouping and instrumental variable methods, 
we discuss the use of ranks. Suppose that we can arrange the individual observations 
in their true order according to the observed value of one of the variables. We now 
suppose, not merely that two or three groups can be so arranged, but that the values 
of x are so far spread out compared with error variances that the series of observed 
é’s is in the same order as the series of unobserved x’s. We now take sufhxes as re- 
ferring to the ordered observations. Again we make the usual assumptions about the 
independence of the errors and consider an even number of values 2m = n. To any 
pair ¢;, 7; there is a corresponding pair £,1;, %m4i and we can form an estimator of «, 
from each of the m statistics. 


a(i) ees f= 12m. (29.114) 
and we may choose either their mean or their median as an estimator of a}. 
Alternatively, we could consider all possible pairs of values 


a(i,j) = coe j= 1,2 <2 (29.115) 
¥] 


There are $n(n—1) of these oe and again we could estimate «, from their mean 
or median. 

These methods, due to Theil (1950), obviously use more information than the 
grouping methods discussed earlier. ‘The advantage of using the median rather than 
the mean resides in the fact that from median estimators it is fairly easy to construct 
confidence intervals, as we shall see in 29.43. 


Example 29.6 


Reverting once more to the data of Examples 29.4 with the middle value omitted, 
we find for the four values of a(z) in (29.114), 


23:4 —6-9 30-2 —12°5 


ice ee 
35-6—20-0 39-1 —15-7 
we see 


The median (half-way between the two middle values) is 1:88. ‘The mean is 1:85. 

If we use (29-115), we can use all nine observations. ‘There are 36 values of a(z,7) 
which, in order of magnitude, are —2:529, —1-154, 0-708, 0-833, 0-941, 1-293, 1-342, 
1-400, 1-458, 1-479, 1-544, 1-618, 1-677, 1-753, 1-797, 1-875, 1-883, 1-892, 1-903, 1-981, 
2-009, 2:053, 2:179, 2-225, 2-385, 2-400, 2-429, 2-435, 2-458, 2-484, 2-764, 2-976, 3-275, 
4-154, 4-412, 5-111. The median value (half-way between the 18th and 19th values) 
is 1:90. The mean is 1-93. 


29.43 We now relax the normality assumptions on the errors and impose a milder 
condition on the term (¢;—«,6;), namely, that it shall have the same continuous distribu- 
tion for all 7. In the terminology of 28.7, we have identical errors in 7—a%)—«;é, 
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together with continuity. It then follows that the probability of one value, say 
€;—%,0;,, exceeding another, &,4;—0 0,4; iS 4. 
Since from (29.114) 
a(i) = Tmti— Me 4 (Mm ti— Hr Fms.s)— (Ma Or Es) 
m+i—$i Sm+i— $i 

we have 
(Em+i— Oy Om+i)—(&— 4 6;) 

Em+1— §4 
The denominator &,,,,—¢; is positive, and consequently the probability that 
a(i)—a«, > Qis 3. Thus the probability that exactly j of the (a(i)—«,) exceed zero, 


a(t)—a, = 


le. %, < a(z), is given binomially as ; ~ so that the probability that the r 
greatest a(z) exceed «, and the 7 smallest a(7) are less than «, is 
$ 1 
P{a(r) < a, < a(m—r+1)} = 1-2 ("a (29.116) 
j=0 


which may be expressed in terms of the Incomplete Beta Function by 5.7 if desired. 
This is a confidence interval for «,. 


29.44 If, in addition, we assume that 6 and e have zero medians, we have 
P {n;,— 46; = Oo } = 3, 


Given any «, we can arrange the quantities 7;,—«, &;, say 2;, in order of magnitude, 
and in the same manner we have 
r 
Pep oie, 4) 12d ("") = (29.117) 
j-0\J/2 

It does not appear possible by this method to give joint confidence intervals for «, 
and «, together, except with an upper bound to the confidence coefficient (cf. Exercise 
29.10). Exercise 29.9 indicates a test of linearity. 

The use of (29.115), when all pairs are considered, is more complicated, the distribu- 
tions no longer being binomial. ‘They are, in fact, those required for the distribution 
of a rank correlation coefficient, ¢, which we discuss in Chapter 31. Given that distri- 
bution, confidence intervals may be set in a similar manner. 


29.45 ‘These methods may be generalized to deal with a linear relation in k vari- 
ables. If we can divide the m observations into k groups whose order, according to 
one variable, is the same for the observed as for the unobserved variable, we may find 
the centre of gravity of each group and determine the hyperplane which passes through 
the k points. If, in addition, the order of observed and unobserved variable is the 
same for every point, we may calculate [n/k] = / relations for the points (€,, £,41, 
Sortiy ++ +> Szri)) (Fa, F142, Sorza,+-+ 5 F429), etc., and average them. Theoretically 
the use of (29.115) may also be generalized, but in practice it would probably be too 
n 


tedious to calculate all the # 


) possible relations. See Exercise 29.10. 


DD 
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29.46 A more thoroughgoing use of ranks is to use the rank values of the €’s, i.e. 
the natural numbers from 1 to m, as an instrumental variable. ‘This method ought to 
be superior in efficiency to grouping methods, since it uses more information. Dorff 
and Gurland (1961b) find it to be generally superior, for small samples, to the two- and 
three-group methods, with smaller bias and mean-square-error. We illustrate the 
method by an example. 


Example 29.7 
“For the data of Example 29.1, we use the ranks of & from 1 to 9 as the values of 
the instrumental variables ¢. Since the &-values are already arranged in order, we 
simply number them from 1 to 9 across the page. Then 
Doing = (1x 6-9)+(2x 12:5)+ ... +(9x 39-1) = 1267-7, 


SCE, = (1x 1-8) + (2% 4-1) 4+... +(9x 18-9) = 549-0. 


From our earlier computations, 


= 2s Ss, 
3S, 


rn SS 


while 5¢, = }n(n+1) = 45, 


so that from the observed means the covariances are 


Loin — HUE: = 1267-7 —23-:14x 45 = 226-40, 
U6,é,-F2C, = 549-0-9:57 x45 = 118-35. 
Thus from (29.91) we have 


a 226-40 

118-55 

the same value as we obtained for the two-group method in Example 29.4, closer to 
the true value 2 than the three-group method’s estimate of 1-86 in Example 29.5. 


= 4-91, 


Controlled variables 

29.47 Berkson (1950) (cf. also Lindley (19530) ) has adduced an argument to show 
that in certain types of experiment the estimation of a linear relation in two variables 
may be reduced to a regression problem. We recall from 29.4 that the relationship 

Y = Gy ta,6+(e—« 6) 

cannot be regarded as an ordinary regression because & is correlated with (¢—«,0). 

Suppose now that we are conducting an experiment to determine the relationship 
between y and x, in which we can adjust to a certain series of values, and then measure 
the corresponding values of 7. For example, in determining the relation between the 
extension and the tension in a spring, we might hang weights (&) of 10 grams, 20 grams, 
30 grams, ..., 100 grams and measure the extensions (7) which are regarded as the 
result of a random error ¢ acting on a true value y. However, our weights may also be 
imperfect, and in attaching a nominal weight of € = 50 grams we may in fact be 
attaching a weight x with error 6 = 50—x. Under repetitions of such experiments 
with different weights, each purporting to be 50 grams, we are really applying a series 
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of true weights x; with errors 6; = 50—x,;. Thus the real weights applied are the 
values of a random variable x. & is called a controlled variable, for its values are fixed 
in advance, while the unknown true values x are fluctuating. 

We suppose that the errors 6 have zero mean. This implies that x has a mean of 
50 = & We now have 
where x, and 6; are perfectly negatively correlated. If we suppose, as before, that 6, 
has the same distribution for all & we may write 


x = €—9d 
and, as before, 
N = (%)+0,%)+¢. (29.118) 
Putting the previous equation into (29.118) we find 
A Ae (29.119) 


which is of the same form as (29.8) but is radically different. For € is not now a random 
variable, and neither ¢ nor 6 is correlated with it. Thus (29.119) is an ordinary regres- 
sion equation, to which the ordinary Least Squares methods may be applied without 
modification, and «) and «, may be estimated and tested without difficulty. 


29.48 Even if the values at which é is controlled are themselves random variables 
(i.e. determined by some process of random selection) the analysis above applies if 
the errors 6 and ¢ are uncorrelated with . The latter assumption is usually fulfilled, 
but the former may be more difficult. In terms of our previous example, suppose that 
we made a random selection of the weights available, and used these for the experiment. 
The requirement that 6 be uncorrelated with now implies, e.g. that the larger weights 
should not tend to have larger or smaller errors of determination in their nominal 
values than do the smaller weights. Whether this is so is a matter for empirical study. 

There is no doubt that, in many experimental situations, the preceding analysis is 
valid. € is often an instrumental reading, and the experimenter often tries to hold é 
to certain preassigned values (not chosen at random, but to cover a specified range 
adequately). In doing so, he is well aware that the instrument is subject to error and 
will not read the true values x precisely. It is comforting, after our earlier discussions 
of the difficulties of situations involving errors of measurement, that the standard LS 
analysis may be made in this common experimental situation. This fact illustrates 
the point, which cannot be too heavily stressed, that a thorough analysis of the sources 
of error and of the nature of the observational process is essential to the use of correct 
inferential methods, and may, as in this case, lead to a simple solution of an apparently 
dificult problem. 


29.49 The analysis of situations of a more complex kind, when some variables are 
controlled and some are not, or when replicated observations are obtained for certain 
values of the controlled variables, requires a careful specification of the model under 
discussion. We have not the space to go into the complications here. Reference may 
be made to T. W. Anderson (1955) and to Scheffé (1958) for some interesting work 
in this field. 
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Curvilinear relations 

29.50 Up to this point we have considered only linear relations between the 
variables. The extension of the methods to curvilinear relationship is not as straight- 
forward a matter as it is in the theory of regression ; and indeed some of the problems 
which arise have successfully resisted attack hitherto. We proceed with an account 
of some work, mainly due to Geary (1942b, 1943, 1949, 1953), in what is, as yet, only 
a partially explored field. 

It will illustrate the kind of difficulty with which we have to contend if we consider 
the quadratic functional relationship 


yY = ap ta, X+a,X7. (29.120) 
On the same assumptions concerning errors ¢ in Y and 6 in X as we made in the linear 
case, and with the further simplification that their variances 03, of are in a known ratio 
which we take to be 1 without loss of generality, we have, for the Likelihood Function 
when the errors are normal, 


1 
log L = constant —2nlog o,— 72 {2 (E;— Xi)? +2 (yi — 9 — &; Xi— Mp X?)*}. (29.4271) 


Differentiation of (29.121) gives 
E,— Xj + (ap ta, X; +H, X?—1i) (%1 + 2%2 X4) a 0, 1 oS # 2 oxaa 3 Wy (29.422) 


E (upto, X;+a,X?—n,) = 0, (29.123) 
E (aot ary Xe+ ay XP) ee (29.124) 
E (9+ 0 X;+ ay XP—n) X? = 0, (29.125) 
245 /2 (E:—Xj+E (m.-t0- a X—a,X?) | = (29.126) 


Summing (29.122) over i, and using (29.123) and (29.124) we find, as before at (29.42), 


that if we measure the é’s from their observed mean 


(2X) = BE, = 0. (29.127) 

If we also measure the 7’s from their mean, we find, from (29.123)-(29.125), and (29.127), 
N XK +u,0 X73 = 0, 

a, UX+a,UXF = Uy, Xi, (29.128) 


ago XP+ a, DX t+a,bX4¢ = Ly, XP. 


(29.128) is of the pattern familiar in regression analysis; but the X’s here are not 
observed quantities. To obtain the ML estimators we must solve the (7 +3) equations 
in (29.122) and (29.128) for the (n+3) unknowns Xj, %, «1, % The estimator of 
o? then follows from (29.126). In practice we should probably solve these equations 
by iterative methods. 

The complication is also obvious from the geometrical viewpoint. We are now, 


FUNCTIONAL AND STRUCTURAL RELATIONSHIP 411 


from (29.121), seeking to determine a quadratic curve such that the sum of squares of 
perpendicular distances of points from it is a minimum. The joins of the different 
points to the curve are not parallel and may even not be unique. A solution, though 
arithmetically attainable, is clearly not expressible in concise form. 


29.51 The product-cumulant method of estimating the coefficients (cf. 29.28) can 
be extended. Consider the cubic structural relationship 


Y= Og tay Xt agx*+a5x°%. (29.129) 


We drop the assumption that the errors are normally distributed. The joint c.f. of 
y and x Is 


ae | exp (0,7 +9) dF (x, 9), 
where 6; = it;. We then have 
7+ * . 2), 
a ey OS 
= | (yaya, 9a? ag29) exp (0, 9 +022) dF =p (29.130) 


by (29.129). Putting y = log¢ and using the relations 


ab, ap 

00 ? 06” 

ad ay  (ayp\? 

ae = 6{5e+ (3) \ | (29.131) 


ee 
ee ee ee a6) f° 


we find, from (29.130) 


oy op ay , ( op \’ By, , ay op (op\'| _ 

oS seta Set (x) \—ag at 02 + (¢) } ee -(). (29.132) 
The equating to zero of coefficients in (29.132) when y is expressed as a power series 
in cumulants gives us a set of equations for the determination of the «’s. These 
equations are linear in the «’s but not, in general, in the cumulants. By (29.81), the 
product-cumulants of x and y are the same as those of and 7. If x and y are normally 
distributed, the method breaks down as in 29.30. 


29.52 This process is not entirely straightforward. We will illustrate the point 
by considering the estimation of «9, «, and «, in the quadratic case («; = 0 in (29.129).) 
We have 


- 0; 
yp = PF ree 
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Without loss of generality we take x,y) = ko, = 0. This is equivalent to taking an 

origin at the mean of the x’s, which is estimated by the mean of the é’s._ From (29.132) 

we then have 

05-05 i ag { Bian AO 
Ae 


p-isl ea 


p> Krg 


Sate te (90-13) 
ey yir r 
Equating coefficients to zero in (29.133), in order, we find 


Constant terms: —G )—G_Kkoq = 0. (29.134) 


This is the only equation involving x». It is useless except for estimating «», which 
means that we must estimate not only a, but «2 (which depends on the variance of 
the error term). We also have 


Terms in 0,: Koo — %1 Kyi — %_Ki2 = O. 

Terms in 6,: K11—%1 Kog— %gKo3 = O. 

Terms in 67: K39— %4 Kg1— %g (Kae +2Kj,) = 0. (29.135) 
Terms in 0,05: #91 —%1k12—%g(Kig +2k11 Koa) = O. 

Terms in 02: = yg —&1 Kg3 — %q (Koa t+ 2ko2) = OV. 


The first equation in (29.135) involves xy, which does not occur again. ‘This equation 
is thus also useless for estimating the «’s. It will be plain from (29.135) that the 
coefficient of any power of 6, say 07, contains x,9, which will not occur in any other 
equation. Such equations are therefore useless for estimating the «’s unless we assume 
that the errors ¢ are normally distributed (with cumulants above the second equal to 
zero), in which case the cumulants «;o, as well as the product-cumulants, may be 
estimated from the observables, and equations such as the third in (29.135) are usable. 
We then can eliminate «,. between the second and fourth equation in (29.135). The 
result, in conjunction with the third equation, enables us to solve for a, and ap. 
If we do not assume the errors to be normal, we require further equations. We 


have from (29.133) 

Terms in 020,: 31—%1 Kag—%o(Kag3+2Ko9 Koi t 431 Ki2) = 9, 

Terms in 6, 05 =; Kee — Gy K13—%q(Kygt2ko3 Ki t4koe X12) = Hf, (29.136) 

Terms in 63: K13—%1 Ko4 — %o(Ko5 + OK 2 Ko3) = O. 
The first two equations of (29.136) contain, apart from product-cumulants, xo. and Kos. 
We can eliminate these with the help of the second and fourth equations in (29.135), 
and solve for «, and «,. We can thus estimate x9, and hence, from (29.134), the value 
of a». Some of the eliminants may be non-linear, in which case we might get more 
than one set of estimators. 


29.53 The two-group estimation method of 29.36 clearly generalizes to poly- 
nomials of order k in x if we can divide the observations into (k+1) groups which are 
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in the same order by £ as by x; we then determine the centre of gravity of each group 
and fit the polynomial to the (k+1) resulting points. Theil’s method (29.42) also 
generalizes in a fairly obvious way. If we divide the observations into (k+1) groups 
and fit [n/(k+1)] parabolas to the points obtained by picking one observation from 
each group, we have only to average the resulting parabolas. This is not as simple 
as it sounds, however. It is not necessarily true that, if we get a set of parabolas 
a) +a,x + a,x?, the best estimated parabola is @)+4,x+4,x*. Some heuristic amalgama- 
tion of the set seems to be indicated, such as fitting by Least Squares in the direction 
of the y-axis, or drawing the curves and selecting one which seems to represent the 
median position so far as possible along its length. 


29.54 'The extension of the analysis of controlled variables to the non-linear case 
involves one or two new points. Although the linear case can be reduced to regression 
analysis, the curvilinear case cannot. Consider the cubic functional relationship 


VY = ata,Xtau,X?+a5X°%. 
If we put € = X+6, 7 = Y+e we find 
N = Gy ta, (E—4)+a,(F—4)? +a (€ —0)°, (29.137) 


where the é’s, the controlled quantities, are fixed. Let us consider repetitions of the 
observations over the same set of é’s and denote by E the expectation in such a reference 
set. Summing (29.137) over the observations, we have 


In = nap ta, DE+asd &+a,U6%+a5b €+43x,% 667+ terms of odd order in 6 or e. 
Taking expectations, we then have 
E(2n) = n(aot %03) + (a1 + 3a3 03) DE +a. US +agug. 
Likewise, multiplying (29.137) by € and summing, we get equations 
E(2né) = (ao to%203) UE +(x, + 3a 05) DEP ta, Use +asd F4, (29.138) 
and so on. These equations can be solved for the quantities 
(9+ 203), (%1 +303 05), Xa, Kye 


It is rather remarkable that, although «, and «, are identifiable, a) and «, are not so 
without knowledge, or an estimate, of 03. The only way round the difficulty seems to 
be to replicate the experiment with the same set of &’s. The papers by Geary (1953) 
and Scheffé (1958) should be consulted for further details. 


29.55 We may be able to reduce a non-linear relationship to a linear one by a 
transformation. Consider, for example, a functional relationship of the type 


y® X’ = constant. 


Here the obvious procedure is to take logarithms. In general, if we can transform 
data to linearity before the estimation begins we shall have gained a great deal. ‘The 
theoretical drawback of this procedure is that if errors in Y and X are, say, normal 
and homoscedastic, those of the transforms log Y and log X will not be. ‘The moral, 
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here as elsewhere, is that we should endeavour to obtain as much prior information 
as possible about the nature of the observational errors ; and that, when the errors are 
substantial and of unknown distribution, we should use methods of estimation which 
make as few assumptions about their nature as possible. 


The effect of observational errors on regression analysis 

29.56 It is convenient to conclude this chapter with a brief account of an allied 
but rather different subject, the effect of errors of observation on regression analysis. 
Suppose that x and y, a pair of random variables, are affected by errors of observation 
6 and e, so that we observe 

E= x+0, 

y= TE. 
As before, we suppose the 6’s independent, the e’s independent, and 6 and e independent. 
Our question now is: suppose that we determine the regression of one observed 
variable on the other, say 7 on €; what relation does this bear to the regression of 
y on x, which is of primary interest ? 

The argument of 29.28 shows us that the product-cumulants of & and 7 are those 
of x and y. Thus cov(é,7) = cov(x*,y). But regression coefficients also depend on 
variances, which are not unchanged. ‘The linear regression equation 

y = Bx, with B, = cov(x,¥)/oz 
is replaced by | 
n = B,é, with f, = cov(é,n)/oz = cov(x,y)/{oz+ 08); 
and clearly B; < B,. The effect of the errors is thus to diminish the slope of the 
regression lines. It follows that the correlation between &, 7 will also be weaker than 
that between x, y. 


29.57 However, this attenuation of the coefficients is not the whole story. Let us 
suppose that the true regression of y on x is exactly linear with identical errors (cf. 28.7). 
Does it follow that the true regression of 7 on & is also exactly linear ? ‘The answer is, 
in general, no; only under certain quite stringent conditions will linearity be unim- 
paired. We will prove a theorem stated in an elegant form by Lindley (1947): a 
necessary and sufficient condition for the regression to continue to be linear is that the 
c.g.f. of the variable x is a multiple of the c.g.f. of the error 6. More precisely, in terms 
of the c.g.f.s of x and ¢, we must have 


Bi Ye = By Ye. (29.139) 


We have seen in 28.7 that it is necessary and sufficient for the regression of y on x 
to be exactly linear, of the form 


y = Bot Prete 
with identical errors, that the joint f.f. of « and y factorizes into 
Ff (*,9) = g(x)A(y—Bo— Bi), (29.140) 
where g(x) is the marginal distribution of x; or equivalently that the c.g.f.s satisfy 
wW(ty te) = Yo (ti tte Bi) + Yn (te) +2te Bo. (29.141) 
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Now we know (28.5) that if ¢(¢,, ¢,) is the joint c.g.f. of &, 7 and the regression of 
7 on & is exactly linear of form 7 = fj+f; é+e, then 


[ee ltnts)| = Bt EEO (29.142) 
2 t,=0 


But if 6, e are independent of each other and of x, y we have 
C(t, te) = p(tr, te)+ yo (t1) + Ye (te). (29.143) 
Substituting (29.143) and (29.141) into (29.142) we find 


dd aod et ad bl = f+ BSE) (29.144) 


Equating coefficients of t, in (29.144) we have 


B, eel) ype Ln, aC oe 0) (29.145) 


Since £(t,,0) = y:(t,), (29.145) integrates at once to (29.139). 
The other terms in (29.144) give, writing w for means, 


nt Bote = Bo- (29.146) 
In particular, if 6 and ¢ have zero means, 
Bo = Bo (29.147) 


This proves the necessity of (29.139). Its sufficiency follows easily. 


29.58 If we also require identical errors in the regression of 7 on £, we obtain a 
much stronger result (Kendall, 1951-2). For then we have (29.141) holding as well 
as (29.143) for the c.g.f. €(t,, t.) of €, 7. Thus, writing primes in (29.141), 


E(t1,t2) = p (tr te) + Yo(tr) + Ye(te) = Yor (tr tte Bi) + Yn (te) +2t2 Bo. (29.148) 
Substituting for y(t,, t,) from (29.141), (29.148) becomes 
Wo (t1+ te B1) + Yn (te) +2 te Bo+ Yo (tr) + Pe (te) = Yy (t1+t2 Bi) + Ya (te) +2F2 Bo. (29.149) 
Putting ¢, = 0 in (29.149) and subtracting the resulting equation from (29.149), we find 


Po (ti + te B1) — Yo (te B1) + Yo (tas) = Yor (t1 + t2 Bi) — Yy" (te B;). (29.150) 
Denoting cumulants of x, & by superfixes (not powers) g and g’, we find from (29.150) 


3 ott By _», Gb se 8 — se itty yg ob (29151) 
aoe ge “rl E r! ae 


Consider a term of order > 2, say r = 3. Identifying powers of the third degree we 
have 


t+ Kh = Ko, 
Pa Ke = fre, (29.152) 
1 Kg = (B;)? Kg. 
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The second and third results are only possible if 6, = f, or if «{ and «f vanish. It 
follows that all third cumulants, and similarly all cumulants higher than the second, 
vanish. 'The converse again follows. Thus, if and only if x, 6, and é are all normal 
will the exact linear regression with identical errors, y = By+{,x become the exact 
linear regression with identical errors 7 = B+ fié. 


29.59 Various other theorems on this subject have been proved. Apparently the 
first was given by Allen (1938), who proved under restrictive conditions that if 


‘ce inet. 


nH = MX+é, 


then the necessary and sufficient condition for the regression of 7 on & to be exactly 
linear for all J in a closed interval is that x and 6 are normal. Some of her conditions 
have been relaxed by Fix (1949a), who requires only that x, 6 and ¢ have finite means 
and that either x or 6 has finite variance. Fix’s result has been generalized further 
by Laha (1956) to the case where the error variables 6, ¢ are not independent. 
Lindley (1947) proved a more general result than (29.139)—see Exercises, 29.12, 29.14. 
The result of 29.58 may also be extended to several variables—see Exercise 29.13. 


EXERCISES 
29.1 Show that the ML estimator &, defined at (29.29) is a monotone function of 4. 


29.2 Referring to the method of 29.28, show that : (a) if the errors 6 are completely 
independent of the x’s, but (b) if (k—1) product-cumulants of the 6’s can be found which 
vanish, then equations (29.85) can still be used to estimate the «’s. In particular, this is 
so if the 6’s are distributed in the multivariate normal form. (Geary, 1942b) 


29.3. Show that equations (29.85) are equally true if there are substituted for the x’s 
the corresponding moments of the x’s, but that it does not then hold for the observed &’s. 


29.4 Show directly that the estimator a, of (29.98) is a consistent estimator of «,, 
and hence that the estimators (29.102) of the error variances are consistent. 


(Wald, 1940) 


29.5 Referring to equation (29.111), show that the confidence region for %) and «, 
consists of the interior of an ellipse. (Wald, 1940) 


29.6 We have n observations, x1, %2, -- +» Xn, on a vector variate, the components of 
which are distributed independently and normally with unit variances. It is desired 
to test the acceptability of the relationship 


ao ta’ x=0 (A) 
where «, is scalar and a’ the transpose of a column vector a. Show that 
= on’x;)’ a’x 
nd = >> (%9 + a’x;) bees j) (B) 
j=l aa 


is distributed as y? with m degrees of freedom. If V is the dispersion matrix of the 
observations, show that the envelope of (A) subject to a constraint imposed by putting 
nf at (B) equal to a constant xj is given by 
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|V— 41] +Vijxixy = 0, 


where I is the identity matrix and Vj; is the cofactor of the (7, j)th element in |V—¢I]. 
Show that this may also be written 


1+x’(V—¢I)—'x = 0. 
(R. L. Brown and Fereday, 1958) 


29.7 In the previous exercise, if the roots of |V—¢4I| = 0 are ¢,,...,¢, and V 
is the diagonal matrix with elements ¢;; and if L is the orthogonal (Rk x k) matrix of 
vectors determined by VL = LA, show that by transforming to new variables y = L’x 
the equation of the envelope is 1+ y’ (A—¢I)-1y = 0 and hence is 


y; 
ety 2: 
j=1 $3-@ 


(R. L. Brown and Fereday, 1958) 


29.8 Inthe previous two exercises, show that if, corresponding to (A) of Exercise 29.6, 
we have in Exercise 29.7 
B o+ B’y = 0, 


then the joint confidence region for the f’s may be written 
k 
Bo+ ha ($5— $0) Bj = 0, 
j= 


where ¢, is the critical value of ¢ at (B) in Exercise 29.6, obtained from the 7? tables. 
(R. L. Brown and Fereday, 1958) 


29.9 Show that the statistics a(z) of equation (29.114) can be used to provide a test 
of linearity of relationship (as against a convex or concave relationship), by considering 
the correlation between a(z) and 1. (Theil, 1950) 


29.10 A set of variables X; are connected with Y by the relation 


k 
= = ho + = aj At. 
j=1 
The errors in the X’s are such that the order of the observed &; (= X; + 0;) 1s the same 
as that of the corresponding X; for all7 = 1, 2,...,. Show that if the 0’s are such that 


. P {zi < 2j3} = 3, 
where 2; = ei— LX «j6;, then a confidence interval in the manner of 29.43 can be set 


j= 
up for any a if the other «’s are given. Hence show how to set up a conservative 
confidence region for all «’s by taking the union of the individual intervals. 


(Theil, 1950) 


29.11 n(= 21+1) observations on n(= Y+e) are made, at equally-spaced unit 
intervals of X, which is not subject to error, so that € = X. ‘The e’s have variance o?, 
If the parameters in Y = %)+a,X are estimated by Least Squares, giving minimum 
variance unbiassed estimators, show that the estimator of «, has variance : 


307/ {1(1+- 1) (27+ 1) }. 


Show further that if the observations are divided into three groups, consisting of 
nk, n—2nk, nk observations, the estimator a, of (29.112) has maximum efficiency when 
k = i, and is then > 8, while the efficiency when k = } is only $3 of this. 

(Bartlett, 1949) 
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29.12 The regression of y on x1, %2,..., X% is given by 
= py By X4y 
j 
and the errors are independent of the x’s. If x; is subject to error 0;(&; = xj+0)), and 


y to error e(y7 = y+), the 6’s being independent of each other and of ¢, show that the 
regression of 7 on the &’s is exactly linear of the form 


n = & B; &; 
j 
if and only if 


0 = d J 
E(B Bs) apt = EB Ge 


where the y’s are c.g.f.s of = suffix ia This generalizes 29.57. 
(Lindley, 1947) 


29.13 In Exercise 29.12, show further that the errors in the second regression are 
independent of the é’s if and only if the distribution of the x’s, &’s and 0’s are all normal. 


This generalizes 29.58. (Kendall, 1951-2) 
29.14 In Exercise 29.12 show that 
Op, 0? ere 
> By Oty Ot; 2B aan, Oty Oty 
Hence, if S is the dispersion matrix of the é’s and A (a diagonal matrix) that of the 6’s 
B— (B’) = BAS 


where ®, (Q’) are the row vectors of the f’s and B’’s respectively. 
(Lindley, 1947) 


29.15 Show that if the unobservables x, y and the instrumental variable ¢ are normally 
distributed with zero means, and a, is defined by (29.91), the variable 


o 07, — 24; Oxy t+ 6 
2 = (n- » | {eee mee ee i} 
(41 Oz¢ — Fy¢) 
is distributed as ‘‘ Student’s ” t? with (n—1) degree of freedom. (Geary, 1949) 


29.16 From the result of the last exercise, show that approximately 
Oxt (1 Ox¢ — Fyt) 
or (a; Ox — 204 Oxy + Oy) — (a1 Ox¢ — yt)” 
= (%1 Ore — out) {a (0% 07 — O22) + (Fre Fye — Fay os) ‘ —2 
{03 (02 0% — 204 Ory + Fy) — (%1 Fee — Fy¢)? }? 
(Madansky, 1959) 


2n* vara, = 


29.17 Assuming that there are no errors of observation in x or y (i.e. § = x, 7 = y), 
show that, for fixed x’s and Os, the estimating efficiency of the estimator a, of (29.91), 
compared to the LS estimator, is equal to the square of the correlation between x and £. 


(Durbin, 1954) 


29.18 When the conditions 5? > 0}, s* 2> > Siq/(S8— 93) are not satisfied in Case 1 of 
29.12, show first that the LF is memtmaiece either when o2 = 0 or when o? = 0, and by 
comparing the maxima that the former always gives the overall maximum. Ficncs show 
that 


eis f=t, 6 = # fe. (Birch, 1964a) 


1 = Sy 


CHAPTER 30 
TESTS OF FIT 


30.1 In our discussions of estimation and test procedures from Chapter 17 on- 
wards, we have concentrated entirely on problems concerning the parameters of 
distributions of known form. In our classification of hypothesis-testing problems 
in 22.3 we did indeed define a non-parametric hypothesis, but we have not yet investi- 
gated non-parametric hypotheses or estimation problems. In the group of four chapters 
of which this is the first, we shall be pursuing these subjects systematically. 

We shall find it convenient to defer a general discussion of non-parametric prob- 
lems, and their special features, until Chapter 31. In the present chapter, we confine 
ourselves to a particular class of procedures which stand slightly apart from the others, 
and are of sufficient practical importance to justify this special treatment. 


Tests of fit 

30.2 Let x1, %.,...,X, be independent observations on a random variable with 
distribution function F(x) which is unknown. Suppose that we wish to test the 
hypothesis 

H,: F(x) = F(x), (30.1) 
where F’,(x) is some particular d.f., which may be continuous or discrete. The prob- 
lem of testing (30.1) is called a goodness-of-fit problem. Any test of (30.1) is called 
a test of fit. 

Hypotheses of fit, like parametric hypotheses, divide naturally into simple and 
composite hypotheses. (30.1) is a simple hypothesis if [9 (x) is completely specified ; 
e.g. the hypothesis (a) that the observations have come from a normal distribution 
with specified mean and variance is a simple hypothesis. On the other hand, we may 
wish to test (b) whether the observations have come from a normal distribution whose 
parameters are unspecified, and this would be a composite hypothesis (in this case it 
would often be called a “‘ test of normality ”). Similarly, if (c) the normal distribution 
has its mean, but not its variance, specified, the hypothesis remains composite. This 
is precisely the distinction we discussed in the parametric case in 22.4, 


30.3 It is clear that (30.1) is no more than a restatement of the general problem of 
testing hypotheses; we have merely expressed the hypothesis in terms of the <.£. 
instead of the frequency function. What is the point of this? Shall we not merely 
be retracing our previous steps ? 

The reasons for the new formulation are several. The parametric hypothesis- 
testing methods developed earlier were necessarily concerned with hypotheses imposing 
one or more constraints (cf. 22.4) in the parameter space; they afford no means 
whatever of testing a hypothesis like (b) in 30.2, where no constraint is imposed upon 
parameters and we are testing the non-parametric hypothesis that the parent df. is 
a member of a specified (infinite) family of distributions. In such cases, and even in 
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cases where the hypothesis does impose one or more parametric constraints, as in (a) or 
(c) of 30.2, the reformulation of the hypothesis in the form (30.1) provides us with 
new methods. For we are led by intuition to expect the whole distribution of the 
sample observations to mimic closely that of the true d.f. F(x). It is therefore natural 
to seek to use the whole observed distribution directly as a means of testing (30.1), and 
we shall find that the most important tests of fit do just this. Furthermore, the “ opti- 
mum ”’ tests we have devised for parametric hypotheses, Hy, have been recommended 
by the properties of their power functions against alternative hypotheses which differ 
from H, only in the values of the parameters specified by Hy. It seems at least likely 
that a test based on the whole distribution of the sample will have reasonable power 
properties against a wider (infinite) class of alternatives, even though it may not be 
optimum against any one of them. 


The LR and Pearson tests of fit for simple H, 

30.4 Two well-known methods of testing goodness-of-fit depend on a very simple 
device. We consider it first in the case when F(x) is completely specified, so that 
(30.1) is a simple hypothesis. 

Suppose that the range of the variate x is arbitrarily divided into k mutually exclusive 
classes. (These need not be, though in practice they are usually taken as, successive 
intervals in the range of x.)(*) Then, since /'y(x) is specified, we may calculate the 
probability of an observation falling in each class. If these are — by Pos 


i = 1,2,...,, and the observed frequencies in the k classes by n; z n; =n), the 
n; are multinomially distributed (cf. 5.30), and from (5. pe we see dit ob LF is 
L (ny, M2, see y Nk | Pow Pow 3 eae » Por) &: IL Poi: (30. 2) 


On the other hand, if the true distribution function is F',(*), where Ff, may be any 
d.f., we may denote the probabilities in the k classes by p,;,7 = 1,2,...,, and the 
likelihood is 

Las a a i pn. (30.3) 


We may now easily find the Likelihood Ratio test of the hypothesis (30.1), the com- 
posite alternative hypothesis being 


H,: F(x) = F,(x). 
The likelihood (30.3) is maximized when we substitute the ML estimators for p,; 
Pius = 1,/n. 
The LR statistic for testing H, against H, is therefore 
L(y, Mo, . ~~ ,Mr| Por Poa +++ »Por) 
E(t te Soe ee PSS 


=a II (Po:/m)™ (30.4) 


H, is rejected when / is small enough. 


(*) We discuss the choice of k and of the classes in 30.20-3, 30.28-30 below. For the present, 
we allow them to be arbitrary. 
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The exact distribution of (30.4) is unknown. However, we know from 24.7 that 
as n—> o0 when H, holds, —2log/ is asymptotically distributed in the y? form, with 
k—1 degrees of freedom (since there are r = R—1 independent constraints p,; because 


k 


30.5 (30.4) is not, however, the classical test statistic put forward by Karl Pearson 
(1900) for this situation. ‘This procedure, which has been derived already as Example 
15.3, uses the asymptotic k-variate normality of the multinomial distribution of the n,, 
and the fact that, given Hp», the quadratic form in the exponent of this distribution is 
distributed in the 7? form with degrees of freedom equal to its rank, R—1. In our 
present notation, this quadratic form was found in Example 15.3 to be () 

Yea s (";—" Poi)” (30.5) 
irate : 

M. E. Wise (1963, 1964) examines the approximations involved in using (30.5) as 
a yz_, variable, and shows that the error is particularly small when the no; are equal 
or nearly so—they need not then be large (cf. 30.22, 30.30 for composite Hy). 

From (30.4) we have 


k 
—2log] = 2 & n,log(n;/ mp oi). (30.6) 
4=1 


The two distinct statistics (30.5) and (30.6) thus have the same distribution asymp- 

totically, given H,. More than this, however, they are asymptotically equivalent statistics 

N;—N Poi 
NPoi 


—2log] = 2Xn,log(1+A,) 
: 2E{(ms—mpoi)+mPoi}{ Ai A+ O(n” ')} 


when H, holds, for if we write A; = , we have 


= 2% { (m—mp) A; +N Poi a, oe } 


and since Xp»; A; = 0, we have 
i 


—2log] = X{npy, Aj+O(n*)} = X?{1+O0(n-)}. (30.7) 


For small 1, the test statistics differ. Pearson’s form (30.5) may alternatively be 


expressed as 
P ee Ft (30.8) 
Ne Po ‘ 
which is easier to compute ; but (30.5) has the advantage over (30.8) of being a direct 
function of the differences between the observed frequencies m; and their hypothetical 
expectations mpo;, differences which are themselves of obvious interest. The corres- 


ponding simplification of (30.6) is computationally inconvenient. 


(*) Following recent practice, we write X® for the test statistic and reserve the symbol ? 
for the distributional form we have so frequently discussed. Earlier writers confusingly wrote 
v2 for the statistic as well as the distribution. 
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Choice of critical region 

30.6 Since H, is rejected for small values of /, (30.7) implies that when using 
(30.5) as test statistic, Hy is to be rejected when X? is large. There has been some 
uncertainty in the literature about this, the older practice being to reject Hy for small 
as well as large values of X?, i.e. to use a two-tailed rather than an upper-tail test. For 
example, Cochran (1952) approves this practice on the grounds that extremely small X? 
values are likely to have resulted from numerical errors in computation, while on other 
occasions such values have apparently been due to the frequencies ; having been 
biassed, perhaps inadvertently, to bring them closer to the hypothetical expecta- 
tions 1Poj. 

Now there is no doubt that computations should be checked for accuracy, but 
there are likely to be more direct and efficient methods of doing this than by examining 
the value of X? reached. After all, we have no assurance that a moderate and accept- 
able value of X? has been any more accurately computed than a very small one. 
Cochran’s second consideration is a more cogent one, but it is plain that in this case 
we are considering a different and rarer hypothesis (that there has been voluntary or 
involuntary irregularity in collecting the observations) which must be precisely formu- 
lated before we can determine the best critical region to use (cf. Stuart (1954a)). Leav- 
ing such irregularities aside, we use the upper tail of the distribution of X? as critical 
region. This will be justified from the point of view of its asymptotic power in 30.27. 


30.7. The essence of the LR and Pearson tests of fit is the reduction of the prob- 
lem to one concerning the multinomial distribution. The need to group the data into 
classes clearly involves the sacrifice of a certain amount of information, especially if 
the underlying variable is continuous. However, this defect also carries with it a cor- 
responding virtue : we do not need to know the values of the individual observations, 
so long as we have k classes for which the hypothetical p); can be computed. In fact, 
there need be no underlying variable at all—we may use either of these tests of fit even 
if the original data refer to a non-numerical classification. ‘The point is illustrated 
by Example 30.1. 


Example 30.1 

In some classical experiments on pea-breeding, Mendel observed the frequencies 
of different kinds of seeds in crosses from plants with round yellow seeds and plants 
with wrinkled green seeds. They are given below, together with the theoretical prob- 
abilities on the Mendelian theory of inheritance. 


Seeds Observed Theoretical 
frequency probability 
ni Poi 
Round and yellow . . . . Bi) 9/16 
Wrinkled and yellow . . . 101 3/16 
Round and green... 108 3/16 
Wrinkled and green... 32 1/16 


— —_—_———. 


n = 556 1 
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(30.8) gives 


1 3152 1012 1082 322 
. Att + $$ —_ + = 5 = 556 
* 5561-5 i Se 
— 16 49 337-3556 = 0-47. 
356° 


For (k—1) = 3 degrees of freedom, tables of y? show that the probability of a value 
exceeding 0:47 lies between 0-90 and 0-95, so that the fit of the observations to the 
theory is very good indeed : a test of any size « < 0-90 would not reject the hypothesis. 

For the LR statistic, (30.6) gives, after considerably more computation, — 2 log/=0-48, 
very close to the value for X?. 


Composite H, 

30.8 Confining our attention now to Pearson’s test statistic (30.5), we consider 
the situation which arises when the hypothesis tested is composite—the LR test remains 
asymptotically equivalent when H, holds—cf. Exercise 30.11. Suppose that F,(x) is 
specified as to its form, but that some (or perhaps all) of the parameters are left un- 
specified, as in (b) or (c) of 30.2. In the multinomial formulation of 30.4, the new 
feature is that the theoretical probabilities po; are not now immediately calculable, 
since they are functions of the s (assumed < k—1) unspecified parameters 6,,6.,...,,, 
which we may denote collectively by 8. ‘Thus we must write them p);(@). In order 
to make progress, we must estimate 6 by some vector of estimators t, and use (30.5) 
in the form 

x2 y LMT MP oi(t) 
i=1 = MP (t) 

This clearly changes our distribution problem, for now the py;(t) are themselves 
random variables, and it is not obvious that the asymptotic distribution of X?2 will be 
of the same form as in the case of a simple Hy. In fact, the term n,—mpp;(t) does 
not necessarily have a zero expectation. We may write X? identically as 

k 


eee ee 2 8 eee | See oe 2 
Us rere i— 1 oi (9) j? +0? {Pos (t) —Po: (6) } 
— 2n{n;—N Poi (8) } {Pox (t) — Po: (8) } J. (30.9) 


Now we know from the theory of the multinomial distribution that asymptotically 
N;—NPo;(8) ~ cn’, 
so that the first term in the square brackets in (30.9) is of order n. If we also have 


Poi (t)—poi(®) = o(n-*), (30.10) 
the second and third term will be of order less than n, and relatively negligible, so that 
(30.9) asymptotically behaves like its first term. Even this, however, still has the 
random variable mppo;(t) as its denominator, but to the same order of approximation 
we may replace this by mp ;(8). We thus see that if (30.10) holds, (30.8) behaves 
asymptotically just as (30.5)—it is distributed in the y? form with (k—1) degrees of 


freedom. However, if the po;(t) are “ well-behaved ” functions of t, they will differ 
EE 
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from the py; (8) by the same order of magnitude as t does from 6. ‘Then for all prac- 
tical purposes (30.10) requires that 

t—6 = o(n-*). (30.11) 
(30.11) is not customarily satisfied, since we usually have estimators with variances 
and covariances of order m-! and then 

t—@ = O(n-). (30.12) 
In this ‘“‘ regular’ case, therefore, our argument above does not hold. But it does 
hold in cases where estimators have variances of order n-?, as we have found to be 
characteristic of estimators of parameters which locate the end-points of the range of 
a variable (cf. Exercises 14.8, 14.13 and 32.11). In such cases, therefore, we may use 
(30.8) with no new theory required. In the more common case where (30.12) holds, 
we must investigate further. 


30.9 It will simplify our discussion if we first give Fisher’s (1922a) alternative 
and revealing proof of the asymptotic distribution of (30.5) for the simple hypothesis 


case. 
Suppose that we have k independent Poisson variates, the ith having parameter 
nfo; The probability that the first takes the value ,, the second m, and so on, is 


k k 
P(my,m,...,Mx) = IL e7"?(npo:)"/n;! = e-"n™ II pot/n,!. (30.13) 
t=1 i=1 


Now consider the probability of observing these values conditional upon their sum 


k 
Xn, = n being constant. The sum of the k independent Poisson variables is itself (cf. 
i=1 


Example 11.11) a Poisson variable with parameter equal to x nD, =n. Thus the 
probability that the sum is equal to 7 is = 

P(in, =n) = en" /al. (30.14) 
We can now obtain the conditional probability we require, 


Pig tig sh 
P(n,,Ng,...,M,| UN; = n) = mee 


| 

SS alan 1 Pathe » > Por (30.15) _ 

We see at once that (30.15) is precisely the multinomial distribution of the m; on which 

our test procedure is based. Thus, as an alternative to the proof of the asymptotic 

distribution of X2 given in Example 15.3 (cf. 30.5), we may obtain it by regarding the 

n, as the values of k independent Poisson variables with parameters 2p ,, subject to 
the condition Xn; = n. By Example 4.9, the standardized variable 


uP oi 
x; (np oi)! (30.16) 
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is asymptotically normal as n—> oo. Hence, as n—> o, 

k 

X* = Dx 

int 
is the sum of squares of k independent normal variates, subject to the single condition 
Xin; = n, which is equivalent to &(npo;)?x; = 0. By Example 11.6, X? therefore has 
i 


a x? distribution asymptotically, with (k—1) degrees of freedom. 


30.10 The utility of this alternative proof is that, in conjunction with Example 11.6, 
to which it refers, it shows clearly that if s further homogeneous linear conditions are 
imposed on the 7;, the only effect on the asymptotic distribution of X? will be to reduce 
the degrees of freedom from (k—1) to (k—s—1). 

We now return to the composite hypothesis of 30.8 in the case when (30.12) holds. 
Suppose that we choose as our set of estimators t of 8 the Maximum Likelihood (or 
other asymptotically equivalent efficient) estimators, so that t = 6. Now the Likeli- 
hood Function L in this case is simply the multinomial (30.15) regarded as a function 
of the 6;, on which the py); depend. ‘Thus 

dlogL_ & ,, bu | 
2 00; Por 
and the ML estimators in this aes case are the roots of the s equations obtained 
by equating (30.17) to zero for each j. Clearly, each such equation is a homogeneous 
linear relationship among the n,. We thus see that, in this regular case, we have s 
additional constraints imposed by the process of efficient estimation of @ from the 
multinomial distribution, so that the statistic (30.8) 1s asymptotically distributed in the 
y? form with (k—s—1) degrees of freedom. A more rigorous and detailed proof is 
een by Cramér (1946)—see also Birch (1964b). We shall call 6 the multinomial ML 


estimator. 


=f. fF (30.17) 


The effect of estimation on the distribution of X? 

30.11 We may now, following Watson (1959), consider the general problem of the 
effect of estimating the unknown parameters on the asymptotic distribution of the X? 
statistic. We confine ourselves to the regular case, when (30.12) holds, and we write 
for any estimator t of 6 

t—0 = n*Ax+o(n-*). (30.18) 
where A is an arbitrary (sx k) matrix and x is the (kx 1) vector whose 7th element is 
N; —NP 9; (9) 

inoi(8))" ae 
defined just as at (30.16) for the simple hypothesis case ; we assume A to have been 
chosen so that E(Ax) = 0. (30.20) 
It follows at once from (30.18) and (30.20) that the dispersion matrix of t, V(t), is of 
order n-1. By a Taylor expansion applied to {pp ;(t)—po;(8)} in (30.9), we find that 
we may write 


x; = 


k 
Af 2 
i=1 
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where, as n—> ©, 


= OPoui(®) 1 
= X,—n? d j= Fee a Oe 1 ) 
: ee) a0, Cos} TOM 
or, in matrix form, 
y = x—7'B(t—6)+0(1), (30.21) 
where B is the (kxs) matrix whose (1,j)th element is 
i a (30.22) 


0; {Poi (8) }* 
Substituting (30.18) into (30.21), we find simply 


y = (I—-BA)x+o(l). (30.23) 


30.12 Now from equation (30.19) the x; have zero means. As n—>oo, they tend 
to multinormality, by the multivariate Central Limit theorem, and their dispersion 
matrix is, temporarily writing p; for pp»; (98), | 

1—p,, —(P1P2)', —(PiPs)’s +++» —(Pibe)* 


= - = ’ ee i, ‘+ kc : 
V(x) = eee Pe (P2Ps) (P2Px) 


(Peps), —(Peba)', —(Pedsy-+-)  L-pe 7. 
= I-(p') (p*)’ (30.24) 
where p? is the (kx 1) vector with ith element {fp ,(8)}?. It follows at once from 
(30.23) and (30.24) that the y,; also are asymptotically normal with zero means and 


dispersion matrix 
V(y) = IBA) {I-(p*) (p*)’} (I- BAY’. (30.25) 


Thus X? = y’y is asymptotically distributed as the sum of squares of k normal variates 
with zero means and dispersion matrix (30.25). If and only if V(y) is idempotent with 
r latent roots unity and (k—7) zero, so that its trace is equal to 7, we know from 15.10-11 
that the distribution of X? is of the y? form with r degrees or freedom. 


30.13 We now consider particular cases of (30.25). First, the case of a simple 
hypothesis, where no estimation is necessary, is formally obtainable by putting A = 0 


in (30.18). (30.25) then becomes simply 
V(y) = V(x) = I-(p*) (p¥y’. (30.26) 
Since (p*)’ (p?) = z Poi (8) = 1, (30.26) is seen on squaring it to be idempotent, and 


its trace is (k— 1). "Thus X? is a y?_, variable in this case, as we already know from 
two different proofs. 


30.14 The composite hypothesis case is not so straightforward. First, suppose 
as in 30.10 that the multinomial ML estimators 6 are used. We seek the form of the 
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matrix A in (30.8) when t = 6. Now we know from 18.26 that the elements of the 
reciprocal of the dispersion matrix of 9 are asymptotically given by 


—< log L 
. 1 = = 
Fv (6) }z Ef on \, oe ee: (30.27) 
From (30.17), the multinomial ML equations give 
@logh _ eo iy 0? Po: 1 Oo: OP oi\ 
00708, 2S Paes 00, po: 00; 26, 048) 


On taking expectations in (30.28), we find 


log L E 1 OD Por § ODo 
re { a0, a 2 m2 5, Dor 00; OO, sm a (30.29) 


The second term on the right of (30.29) is zero, since it is =;-~7 7a, a0 : : Pov Thus, using 
Li= 


(30.22), 


log L 
{ar} = A 2 10c 


so that, from (30.27), 3 
{V(6)}-1 = nB’B 


= C = nV(6) = (B/B)-. (30.30) 
But from (30.18) and (30.24) we have 
D = nV(t) = AV(x)A’ = A{I-(p*)(p*)’} A’. (30.31) 


Here (30.30) and (30.31) are alternative expressions for the same matrix. 
We choose A to satisfy (30.30) by noting that 


B’ (p*) = 0 (30.32) 
(since the jth element of this (s x 1) vector iS ay ‘ x Poi = (), and hence that if A = GB’ 
where G is symmetric and non-singular, (30. ot a GB’BG’. If this is to be equal 
to (30.30), we obviously have G = (B’B)-, so finally 

= (ER) (30.33) 
in the case of multinomial ML estimation. (30.25) then becomes, using (30.32), 

V(y) = {1-B(B’B) "B’ }?— (p*) (p*)’ 
= I—B(B’B)-'B’ — (p*) (p*)’. (30.34) 
By squaring, this matrix is shown to be idempotent. Its rank is equal to its trace, which 
as in 19.9 is given by : 
trV(y) = tr{I—(p*) (p’)’}—tr B’. B(B'B)™, 
and using 30.13 this is 
tr V(y) = (A-1)-s. 

Thus the distribution of X? is y7_, , asymptotically, as we saw in 30.10. 
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30.15 Our present approach enables us to inquire further: what happens to the 
asymptotic distribution of X? if some other estimators than 6 are used? This question 
was first considered in the simplest case by Fisher (1928b). Chernoff and Lehmann 
(1954) considered a case of particular interest, when the estimators used are the ML 
estimators based on the n individual observations and not the multinomial ML esti- 
mators 6, based on the k frequencies m,;, which we have so far discussed. If we have 
the values of the observations, it is clearly an efficient procedure to utilize this know- 
ledge in estimating @, even though we are going to use the k-class frequencies alone 
in carrying out the test of fit. We shall find, however, that the X? statistic obtained in 
this way no longer has an asymptotic y? distribution. 


30.16 Let us return to the general expression (30.25) for the dispersion matrix. 
Multiplying it out, we rewrite it 
V(y) = {1-(p) (pt) }-BA {I (p*) (P*)'5 
— {1-(p) (pt) J A'B’+ BA{I—(p?) (pi) JAB. (30.35) 
Rather than find the latent roots 2; of V(y), we consider those of I—V(y), which 
are 1—/;. We write this matrix in the form 
I—V(y) = (p*) (p*)’ +B [A {1— (p) (pt) }— 34 {I- (p*) (Pp); AB’) 
+[{I—(p*) (pt)’} A’ — 3B A {I— (p') (p*) 5 A’ JB" 
= (p*) (p*)’ + B[A{I—(p*) (p*)'}— 2DB'] 
+[{I—(p4) (p*)’ }A’—- 2B DIB. (30.36) 
On substituting (30.31), (30.36) may be written as the product of two partitioned matrices, 
giving 
(p*) (p*)’ 
i-Vy) = B [Acer oy) 108 
{1—(p*) (p#)’} A’-23BD B’ 
The matrices on the right may be transposed without affecting the non-zero latent 
roots. This converts their product from a (kx k) to a (2s+1) x (2s+1) matrix, which 
is reduced, on using (30.30) and (30.32), to 


se 1 
| 


_ ees Gis. 


co B’A’—-4C"1D 
(30.37) 


(30.37) has one latent root of unity and 2s others which are those of the matrix M par- 
titioned off in its south-east corner. If k > 2s+1, which is almost invariably the case 
in applications, this implies that (30.36) has (k—2s—1) zero latent roots, one of unity, 
and 2s others which are the roots of M. Thus for V(y) itself, we have (k—2s—1) 
latent roots of unity, one of zero and 2s which are the complements to unity of the 
latent roots of M. 
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30.17 We now consider the problem introduced in 30.15, Suppose that, to 
estimate 8, we use the ML estimators based on the m individual observations, which 
we shall call the “‘ ordinary ML estimators ” and denote by 6*. We know from 18.26 
that if f is the frequency function of the observations, we have asymptotically 


D = nV(6) = -{B (Fr ) y (30.38) 


and that the elements of 6* are the roots of 


dlogL _ = 
20, a=, o_o. See 


where L is now the ordinary (not the multinomial) Likelihood Function. Thus if @, 
is the true value, we have the Taylor expansion 


alog 4 E log z E log FI ss 
0 ———— =e eee + ee, . ee crane aie ae 1, = :; =. oe yg 9 
| 00; 0;=6; 00; 0;= 903 ( : 0) 00; 06; 0;=Ooj+6 , : 


and as in 18.26 this gives asymptotically, using (30.38), 


dlog L 
20, 
6*_9 = ‘D ap (30.39) 
Alog L 
20, 


® being the vector of true values. Now since both sets of ML estimators are consistent, 
- . will to the first order of approximation be equivalent to (30.17), which is, using 


j 
(30.22), 
ee! eee 
Toby = ta bas, 
i=1(Poi)® “ i=1 (Poi)! s 
k 
and we may write this in terms of (30.19) as n! X x;b,; = n'(B’x);. Thus (30.39) is 
i=1 
6* = 64n- DB’x, (30.40) 
and comparison of (30.40) with (30.18) shows that here we have 


A = DB’. (30.41) 


30.18 The dispersion matrix (30.25) now becomes 
V(y) = (I-B DB’) {I—(p#) (p*)'} IB DB’), 
which on using (30.31), (30.32) and (30.41) becomes : 
V(y) = I—(p*)(p*)’ —B DB’. (30.42) 
(30.42) is not idempotent, as may be seen by squaring it. Moreover, the three matrices 
on its right are all non-negative definite, and we may write 
D+P=C 
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where P is non-negative definite, for D is the dispersion matrix of the fully efficient 
ordinary ML estimators, while C is that of the multinomial ML estimators, each of 
whose diagonal elements cannot be less than that of D. ‘Thus (30.42) may be written 
V(y) = I-(p*)(p?)’ -—BCB’+BPB’. (30.43) 
The first two terms on the right of (30.43) are what we got at (30.26) for the case when 
no estimation takes place, when V(y) has (R—1) latent roots unity, and one of zero. 
The first three terms are (30.34), when, with multinomial ML estimation, V(y) has 
(k—s—1) latent roots unity and (s+1) of zero. Because of the non-negative definiteness 
of all the terms, reduction of (30.43) to canonical form shows that the latent roots of 
(30.43) are bounded by the corresponding latent roots of (30.26) and (30.34). Thus 
(30.43) has (k—s—1) latent roots of unity, one of zero, and s between zero and unity, 
as established by Chernoff and Lehmann (1954). 
It follows from the fact that as Rk increases, the different sets of ML estimators 
6 and 6* draw closer together, so that D—>C, that the last s latent roots tend to 
zero as k —> oo. 


30.19 What we have found, therefore, is that X? does not have an asymptotic y? 
distribution when fully efficient (ordinary ML) estimators are used in estimating 
parameters. However, the distribution of X? is bounded between a y7_, anday? , 
variable, and as k becomes large these are so close together that the difference can be 
ignored—this is another way of expressing the final sentence in 30.18. But for Rk small, 
the effect of using the y7_,_, distribution for test purposes may lead to serious error ; 
for the probability of exceeding any given value will be greater than we suppose. s is 
rarely more than 1 or 2, but it is as well to be sure, when ordinary ML estimation is 
being used, that the critical values of y7_,_, and y7_, are both exceeded by X*. The 
tables of y? show that, for a test of size « = 0-05, the critical value for (R—1) degrees 
of freedom exceeds that for (k—s—1) degrees of freedom, if s is small, by C's, approxi- 
mately, where C declines from about 1:5 at (kR—s—1) = 5 to about 1:2 when 
(k—s—1) = 30. For « = 0-01, the corresponding values of C are about 1-7 and 1:3. 


The choice of classes for the X? test 


30.20 ‘The whole of the asymptotic theory of the X? test, which we have discussed 
so far, is valid however we determine the k classes into which the observations are 
grouped, so long as they are determined without reference to the observations. 'The 
italicized condition is essential, for we have made no provision for the class boundaries 
themselves being random variables. However, it is common practice to determine 
the class boundaries, and sometimes even to fix & itself, after reference to the general 
picture presented by the observations. We must therefore discuss the formation of 
classes, and then consider how far it affects the theory we have developed. 

We first consider the determination of class boundaries, leaving the choice of k 
until later. If there is a natural discreteness imposed by the problem (as in Example 
30.1 where there are four natural groups) or if we have a sample of observations from 
a discrete distribution, the class-boundary problem arises only in the sense that we 
may decide (in order to reduce &, or in order to improve the accuracy of the asymptotic 
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distribution of X? as we shall see in 30.30 below) to amalgamate some of the hypothetical 
frequencies at the discrete points. Indeed, if a discrete distribution has infinite range, 
like the Poisson, we are forced to some amalgamation of hypothetical frequencies if 
k is not to be infinite with most of the hypothetical frequencies very small indeed. 
But the class-boundary problem arises in its most acute form only when we are sampling 
from a continuous distribution. ‘There are now no natural hypothetical frequencies 
at all. If we suppose k to be determined in advance in some way, how are the boun- 
daries to be determined ? 

In practice, arithmetical convenience is usually allowed to dictate the solution : 
the classes are taken to cover equal ranges of the variate, except at an extreme where 
the range of the variate is infinite. ‘The range of a class is roughly determined by the 
dispersion of the distribution, while the location of the distribution helps to determine 
where the central class should fall. ‘Thus, if we wished to form Rk = 10 classes for 
a sample to be tested for normality, we might roughly estimate (perhaps by eye) the 
mean # and the standard deviation s of the sample and take the class-boundaries as 
&+45j, 7 = 1,2,3,4. The classes would then be 


(—0o,#—2s), (#—2s,#-1-5s), (%—1-5s,#—-s), (%—s,%—0-5s), (%—0-5s, £), 
(%,€+0-5s), (G+0-5s,8+5), (+s,841-5s), (G+1-5s,¥+2s), (#+2s, 00). 


30.21 Although this procedure is not very precise, it clearly makes the class- 
boundaries random variables, and it is not at once obvious that the asymptotic distribu- 
tion of X?, calculated for classes formed in this way, is the same as in the case where 
the classes are fixed in advance. However, intuition suggests that since the asymptotic 
theory holds for any set of k fixed classes, it should hold also when the class-boundaries 
are determined from the sample. ‘That this is indeed so when the class-boundaries 
are determined by regular estimation of the unknown parameters was shown for the 
case of a normal distribution by Watson (1957b) and for continuous distributions in 
general by A. R. Roy (1956) and Watson (1958, 1959). 

We may thus neglect the random variations of the class-boundaries so far as the 
asymptotic distribution of X?, when H, holds, is concerned. Small-sample distribu- 
tions, of course, will be affected, but nothing is yet known of the precise effects. (We 
discuss small-sample distributions of X? in the fixed-boundaries case in 30.30 below.) 


The equal-probabilities method of constructing classes 

30.22 We may now directly face the question of how class-boundaries should be 
determined, in the light of the assurance of the last paragraph of 30.21. If we now seek 
an optimum method of boundary determination, it must be in terms of the power of 
the test ; we should choose that set of boundaries which maximizes power for a test 
of given size. Unfortunately, there is as yet no method available for doing this, 
although it is to be hoped that the recent re-awakening of interest in the theory of 
X? tests will stimulate research in this field. We must therefore seek some means of 
avoiding the unpleasant fact that there is a multiplicity of possible sets of classes, 
any one of which will in general give a different result for the same data; we require 
a rule which is plausible and practical. 
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One such rule has been suggested by Mann and Wald (1942) and by Gumbel 
(1943); given k, choose the classes so that the hypothetical probabilities po; are all 
equal to 1/k. This procedure is perfectly definite and unique. It varies arithmeti- 
cally from the usual method, described in 30.20 (in which the classes are variate- 
intervals of equal width) in that we have to use tables to ensure that the po; are equal. 
This requires for exactness that the data should be available ungrouped. ‘The pro- 
cedure is illustrated in Example 30.2. 


Example 30.2 

Quenouille (1959) gives, apart from a change in location, 1000 random deviates 
from the distribution 

dF = exp(—x) dx, in > 

The first 50 of these, arranged in order of variate-value, are : | 
0-01, 0-01, 0-04, 0-17, 0-18, 0-22, 0:22, 0-25, 0-25, 0-29, 0-42, 0-46, 0-47, 0:47, 0-56, 
0:59, 0-67, 0:68, 0-70, 0-72, 0-76, 0-78, 0-83, 0-85, 0-87, 0-93, 1-00, 1-01, 1-01, 1-02, 
1-03, 1:05, 1:32, 1-34, 1-37, 1-47, 1-50, 1:52, 1:54, 1-59, 1-71, 1-90, 2-10, 2-35, 2-46, 
2:46, 2:50, 3-73, 4-07, 6-03. 


Suppose that we wished to form four classes for a X? test. A natural grouping 
with equal-width intervals would be 


Variate-values Observed Hypothetical 
frequency frequency 
0-0:50 14 19-7 
0:51-1:00 13 11:9 
1:01-1:50 10 7:2 
1°51 and over 13 11-2 
50 50-0 


The hypothetical frequencies are obtained from the Biometrika Tables distribution 
function of a y? variable with 2 degrees of freedom, which is just twice a variable with 
the distribution above. We find X? = 3-1 with 3 degrees of freedom, a value which 
would not reject the hypothetical parent distribution for any test of size less than 
a = 0:37; the agreement of observation and hypothesis is therefore very satisfactory. 

Let us now consider how the same data would be treated by the method of 30.22. 
We first determine the values of the hypothetical variable, dividing it into four equal- 
probability classes—these are, of course, the quartiles. ‘The Biometrika Tables give 
the values 0-288, 0-693, 1-386. We now form the table: 


Variate-values Observed Hypothetical 
frequency frequency 
0-0:28 9 12°5 
0:29-0:69 9 12:5 
0-70-1-38 17 12:5 
1:39 and over 15 12:5 


Sa —Ssae as 


50 50:0 
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X? is now easier to calculate, since (30.8) reduces to 
k 
A= : x ni—n (30.44) 
NM j=1 

since all hypothetical probabilities py); = 1/k. We find here that X? = 3-9, which 
would not lead to rejection unless the test size exceeded 0-27. The result is still very 
satisfactory, but the equal-probabilities test seems rather more critical of the hypothesis 
than the other test was. 

It will be seen that there is little extra arithmetical work involved in the equal- 
probabilities method of carrying out the X? test. Instead of a regular class-width, 
with hypothetical frequencies to be looked up in a table (or, if necessary, to be cal- 
culated) we have irregular class-widths determined from the tables so that the hypo- 
thetical frequencies are equal. 

We have had no parameters to estimate in this example. If s parameters must be 
estimated, we necessarily encounter the, problem of 30.15-19: if ordinary ML estimators 
are used, the conclusions of 30.19 apply. 


30.23 Apart from the virtue of removing the class-boundary decision from un- 
certainty, the equal-probabilities method of forming classes for the X* test will not 
necessarily increase the power of the test, for one would suspect that a “ goodness-of- 
fit? hypothesis is likely to be most vulnerable at the extremes of the range of the variable, 
and the equal-probabilities method may well result in a loss of sensitivity at the 
extremes unless k is rather large. ‘This brings us to the question of how k should 
be chosen, and in order to discuss this question we must consider the power of the 
X? test. First, we investigate the moments of the X? statistic. 


The moments of the X? test statistic 
30.24 We suppose, as before, that we have hypothetical probabilities py; when 
H, holds, so that our test statistic is, as at (30.8), 


We confine ourselves to the simple hypothesis. Suppose now that the true prob- 
abilities are ~,;, 71 = 1,2,...,k. The expected value of the test statistic is then 


es ‘ 
fA == has 


N i=1 Poi 
From the moments of the multinomial distribution at (5.80), 
E(n?) = npy;(1—pri) +n" pii, (30.45) 
k : — : k 2. = 
whence B(X2) = 5 Pull—Puw) +n{ z |, (30.46) 
i=1 = Poi i=1 Poi 


When 4H, holds, this reduces to 
E(X?| H,) = k-1. (30.47) 
This exact result is already known to hold asymptotically, since X? is then a yj, 
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variate. If we differentiate (30.46) with respect to the p,,, subject to & p,; = 1, we find 
i 


that, as n—» 00, (30.46) has its minimum value when ~,; = po:. For any hypothesis 
H, specifying a set of probabilities p,; # po;, we therefore have asymptotically 
| =#E(X?| Hy) > k-1. (30.48) 
(30.48), like the asymptotic argument based on the LR statistic in 30.6, indicates that 
the critical region for the X? test consists of the upper tail, although this indication is 
not conclusive since the asymptotic distribution of X? is not of the y? form when H, 
holds. ‘This alternative distribution is, in fact, a non-central y? under the conditions 
given in 30.27 below. : 
Even the variance of X? is a relatively complicated function of the po; and p,; (cf. 


Exercise 30.5). However, in the equal-probabilities case ( t= i) the asymptotic 


variance simplifies considerably and we find (the proof is left to the reader as Exercise 


30.3) 


var (X?| Hy) ~ 2(k—1), 
var (X?| Hy) ~ 4 (n—1)k? {= p},— (= p3,)?}. (30.49) 
From (30.46) we also have in the equal-probabilities case 
E(X?| H,) = (k-1)+(n—1) (kX pj,— 1). (30.50) 


(30.50) is always greater than (R—1) for any n. 


Consistency and unbiassedness of the X? test 
30.25 Equations (30.49) and (30.50) are sufficient to demonstrate the consistency 
of the equal-probabilities X? test. For the test consists in comparing the value of 
X? with a fixed critical value, say c,, in the upper tail of its distribution. Now when 
H, holds, the mean value and variance of X? are each of order n. By ‘Tchebycheft’s 
inequality (3.95), 
P{| X?—E(X?)| > A[var(X2)]}}} < = (30.51) 
Since c, is fixed, it differs from H(X?) by a quantity of order m, so that if we require 
the probability that X? differs from its mean sufficiently to fall below c,, the multiplier 
A on the left of (30.51) must be of order u?, and the right-hand side is therefore of order 
at. Tis 
isn FEAT = ¢,; = 0, 
n—> 0 
and the test is consistent for any H, specifying unequal class-probabilities. 
The general X? test, with unequal po,, is also consistent against alternatives speci- 
fying at least one p,; # Po;, as is intuitively reasonable. A proof is given by Neyman 


(1949). 


30.26 Although the X? test is consistent (and therefore asymptotically unbiassed), 
one would not expect it to be unbiassed in general against very close alternatives for 
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small 2. However, Mann and Wald (1942) have shown that the equal-probabilities 
test is locally unbiassed. Write P for the power of the test and expand P in a Taylor 
series with the & values 0; = p;—(1/k) as arguments. We then have 


P(81,02,..-, 94) = P(0,0,...,0) +26; 0, 
Q2 P ee @ 
* 2 
4 {6 OP DEO ge a0, (30.52) 


all derivatives being taken at the Hy point (0,0,...,0). For a size-« test, 
PO, 9,...., 0) =. 


Further, since P is a symmetric function of the 0, all the = are equal at (0, 0,..., 0), 


00; 
— or o2P 
and similarly for the 26 and the 20, 00, We may therefore rewrite (30.52) as 
oP o2P oF 
P= ae LO;+3 {im x 6? + 56, 06, ap 226,0,}, (30.53) 


Now 
6: =-2.5, = (2 6,;)? = 067+ 26;0;. 
i i i ij 


Thus (30.53) becomes simply 


eP @P \__. 
P=a+} (Sie a0, )ES+ (30.54) 


We may evaluate the second-order derivatives in (30.54) directly from the exact expres- 
sion for the power 


n! : 
Pee fy eet (30.55) 


X*>0, M1!Mq!... k 


So Pi-7, pe Wad fans (3055) 


aP 9p n! 1\"-2 
es ae : = me 
ee. CD at in. — Gj) 


= k?X{ni—n,—nyn2} fry (30.56) 
| 
where f, = —— %* ___R-™ and all unlabelled summations are now over the critical 
41 ted n,! 
region X? > Cy, nas from the form of X? given at (30.44), is equivalent to 


k 
2 
ie > b.. 
= 


Now —i7?f, is the mean of nj in the critical region, and this must exceed its overall 


mean, which by (30.45) is (1-744). Thus 


Sure = ag (1—p+E) +4 (30.57) 
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where d > 0. By exactly the same argument, since m, is positive, we have 


1 n 
=> = 
= nif, > z OF 


uns fn = «(n/k)+e (30.58) 
where e > 0. Moreover, we obviously have in (30.57) and (30.58) 
a>. (30.59) 


From symmetry, we have 


Emi tafa = Egg mt Enl) | f 


ss aS 
= k(k—-1) koi (30.60) 
Using (30.57)-(30.60), (30.56) becomes 
1 fo?P 02 P k n = n n? 


Thus, in (30.54), the second term on the right is positive. ‘The higher-order terms 
neglected in (30.54) involve third and higher powers of the 0; and will therefore be 
of smaller modulus than the second-order term near H,. ‘Thus, P > « near H, and 
the equal-probabilities test is locally unbiassed, which is a recommendation of this 
class-formation procedure, since no such result is known to hold for the X? test in 
general. 


The limiting power function 

30.27 Suppose that, as in our discussion of ARE in Chapter 25, we allow H, to 
approach H, as n increases, at a rate sufficient to keep the power bounded away from 1. 
In fact, let p1;—po; = c;m~* where the c; are fixed. Then the distribution of X? is 
asymptotically a non-central y? with degrees of freedom k—s—1 (where s parameters 
are estimated by the multinomial ML estimators) and non-central parameter 


gy (Pua Poi)? 

A ace oe ree (30.62) 
This result, first announced by Eisenhart (1938), follows at once from the representa- 
tion of 30.9-10; its proof is left to the reader as Exercise 30.4. It now follows by using 
the Neyman—Pearson lemma (22.6) on (24.18) that the best critical region for testing 

A = 0 consists of the upper tail of the distribution of X ?. 
The approximation to the non-central y? distribution in 24.5 enables us to evaluate 
the approximate power of the X? test. In fact, this is given precisely by the integral 
(24.30). For « = 0-05, the exact tables by Patnaik described in 24.5 may be used. 


Example 30.3 


We may illustrate the use of the limiting power function by returning to the prob- 
lem of Example 30.2 and examining the effect on the power of the equal-probabilities 
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procedure of doubling k. To facilitate use of the Biometrika Tables, we actually take 
four classes with slightly unequal probabilities : 


Values Poi Pu (Pur—Poi)® (Pis—Doi)? Pos 
0-0:3 0-259 0-104 0-0240 0:0927 
0-3-0-7 0-244 0-190 0:0029 0:0119 
0-7-1-4 0-250 0-282 0-:0010 0-0040 
1-4 and over 0-247 0-424 0-:0313 0-1267 
0:2353 = = 
n 


In the table, the p); are obtained from the Gamma distribution with parameter 1, as 
before, and the p,; from the Gamma distribution with parameter 1-5. For these 
4 classes, and nm = 50 as in Example 30.2, we evaluate the non-central parameter of 
(30.62) as 2 = 0-2353 x50 = 11-8. With 3 degrees of freedom for X?, this gives a 
power when « = 0-05 of 0-83, from Patnaik’s table. 


Suppose now that we form eight classes by splitting each of the above classes into 
two, with the new py; as equal as is convenient for use of the Tables. We find: 


Values Poi Pu (Pii—Poir)® (D1 — Doz)? Por 
0-0:15 0-139 0-040 0-:0098 0-:0705 
0-15-0:3 0-120 0-064 0-0031 0:0258 
0-3 —0-45 0-103 0-071 0-0010 0-:0097 
0-45-0:7 0-141 0-119 0-0005 0:0035 
0-7 -1:0 0-129 0-134 0-0000 0:0002 
1-0 -1-4 0-121 0-148 0:0007 0:0058 
1-4 —2-1 0-125 0-183 0-:0034 0:0272 
2:1 and over 0-122 0-241 0:0142 0-1163 

0:2590 = # 

n 


For n = 50, we now have 4 = 13-0 with 7 degrees of freedom. The approximate 
power for « = 0-05 is now about 0-75 from Patnaik’s table. ‘The doubling of & has 
increased A, but only slightly. The power is actually reduced, because for given A 
the central and non-central y? distribution draw closer together as degrees of freedom 
increase (cf. Exercise 24.3) and here this effect is stronger than the increase in A. How- 
ever, 7 is too small here for us to place any exact reliance on the values of the power 
obtained from the limiting power function, and we should perhaps conclude that the 
doubling of k has affected the power very little. 


The choice of k with equal probabilities 

30.28 With the aid of the asymptotic power function of 30.27, we can get a heur- 
istic indication of how to choose k in the equal-probabilities case. The non-central 
parameter (30.62) is then 


(30.63) 


1 


R all 7, and consider what happens as k becomes large. 


We now assume that |6;| < 
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k 
6 = > 62, asa function of k, will then be of the same order of magnitude as a sum of 
i= 

squares in the interval G: i) i.e. 

1/k 1/k 
O~a | udu = 2a | u? du. (30.64) 

—1/k 0 


The asymptotic power of the test is a function P{k,4} which therefore is P{R, A (R)}; 


it is a monotone increasing function of 2, and has its stationary values when —— ais 
We thus put, using (30.63) and (30.64), 
_ 1d 6 pee S Sere 
0-25 6+k.2a(;) ( a) = 6—2ak 
giving 
k-3 ~ 6/(2a). (30.65) 


We cannot let k —> oo without restriction since all 0; then —> 0, but we assume large 
enough so that both the H, and H, distribution of A® are near normality, and the 
approximate power function of the test is (cf. (25.53)) therefore 


E E(X?| A) 


o> 
(var(XH)P — 
where G(—A,) = % (30.67) 
determines the size of the test. From (30.49) and (30.50) 
E B(X*|H,)| =f Hk (30.68) 
dé 0=0 
var (X?| H,) = 2(k—1), (30.69) 
and if we insert these values and also (30.65) into (30.66), we obtain approximately 
P = G{2'a(n—1)k*?—A,}. (30.70) 


This is the approximate power function at the point where power is maximized for 
choice of k. If we choose a value P, at which we wish the maximization to occur, we 
have, on inverting (30.70), 


G-1{P,} = 2ta(n—1)k *?—A,, (30.71) 
= 2:(n—1) \* 
or k= 1 SaTPD \ : (30.72) 


where 6 = a’, 


30.29 In the special case P) = } (where we wish to choose k to maximize power 
when it is 0:5), G-1(0-5) = 0 and (30.72) simplifies. In this case, Mann and Wald 
(1942) obtained (30.72) by a much more sophisticated and rigorous argument—they 
found b = 4 in the case of the simple hypothesis. Our own derivation suggests that 
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the same essential argument applies for the composite hypothesis, but b may be different 
in this case. 


We conclude that k should be increased in the equal-probabilities case in proportion 
to n*/*, and that k should be smaller if we are interested in the region of high power 
(when G —'{P,} is large) than if we are interested in the “‘ neighbouring ” region of low 
power (when G~'{P9} approaches —2,, from above since the test is locally unbiassed). 

With 6 = 4 and P, = 3, (30.72) leads to much larger values of & than are com- 
monly used. k will be doubled when n increases by a factor of 4V2. When n = 200, 
k = 31 for « = 0-05 and k = 27 for « = 0-01—these are about the lowest values of k 
for which the approximate normality assumed in our argument (and also in Mann and 
Wald’s) is at all accurate. In this case, Mann and Wald recommend the use of (30.72) 
when n > 450 for « = 0-05 and m > 300 for « = 0-01. It will be seen that n/k, the 
hypothetical expectation in each class, increases as n°/°, and is equal to about 6 and 8 
respectively when m = 200, « = 0-05 and 0-01. 

C. A. Williams (1950) reports that k can be halved from the Mann—Wald optimum 
without serious loss of power at the 0-50 point. But it should be remembered that 
n and k must be substantial before (30.72) produces good results. Example 30.4 


illustrates the point which is also borne out by calculations made by Hamdan (1963) 
for tests of a normal mean. 


Example 30.4 
Consider again the problem of Example 30.3. We there found that we were at 
around the 0-8 value for power. From a table of the normal distribution G- (0-8) = 
0-384. With b = 4, « = 0-05, 2, = 1-64, (30.72) gives for the optimum k around this 
point ; 12 
*(a- = SF fo 12/5 
2-48 = 3-2(n—1)?/, 
For n = 50, this gives k = 15 approximately. 
Suppose now that we use the Biometrika Tables to construct a 15-class grouping 
with probabilities p 5; as nearly equal as is convenient. We find 


pes 


Values Poi Pu (P11 — Doi)? /Por 
0-0:05 0-049 0-008 0-034 
0:05-0°15 0-090 0-032 0:037 
0:15-0:20 0:042 0-020 0-012 
0-20-0-30 0-078 0-044 0-015 
0:30-0:40 0-071 0-047 0-008 
0:40-0:50 0-063 0-048 0-004 
0:50-0:65 0-085 0-072 0-002 
0:65-0:75 0:050 0-047 0-000 
0:75-0:90 0-065 0-067 0-000 
0-90-1:1 0-074 0:083 0-000 
1-1 -1:3 0-060 0:075 0-004 
1:3 -1°6 0-071 0-095 0-008 
1-6 —2-0 0-067 0-101 0:018 
2:0 —2:7 0:068 0-116 0-034 
2:7 and over 0-067 0-145 0-098 
0-274 = A/n. 


Here A = 13-7 and Patnaik’s table gives a power of 0-64 for 14 degrees of freedom. 


FF 
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A has again been increased, but power reduced because of the increase in k. We are 
not at the optimum here. With large k (and hence large n), the effect of increasing 
degrees of freedom would not offset the increase of 4 in this way. 


30.30 We must not make k too large, since the multinormal approximation to the 
multinomial distribution cannot be expected to be satisfactory if the upp; are very small. 
A rough rule which is commonly used is that no expected frequency (mpy;) should be 
less than 5. ‘There seems to be no general theoretical basis for this rule, and two 
points are worth making concerning it. If the H, distribution is unimodal, and equal- 
width classes are used in the conventional manner, small expected frequencies will 
occur only at the tails. Cochran (1952, 1954) recommends that a flexible approach 
be adopted, and has verified that one or two expected frequencies may be allowed to 
fall to 1 or even lower, if X2 has at least 6 degrees of freedom, without disturbing the 
test with a = 0-05 or 0-01. 

In the equal-probabilities case, all the expected frequencies will be equal. Slakter 
(1966) shows that even for fractional equal expected frequencies, the approximation 
remains good. The Mann—Wald procedure of 30.29 leads to expected frequencies 
always greater than 5 for n > 200. It is interesting to note that in Examples 30.34, 
the application of this limit would have ruled out the 15-class procedure, and that the 
more powerful 8-class procedure, with expected frequencies ranging from 5 to 7, would 
have been acceptable. 

Hoeffding (1965) shows that for testing a simple Hy (the composite case is less clear), 
if k is held fixed while « —> 0 suitably as n> ©, the LR test is more powerful than the 
X? test. This result does not hold if k increases with n, e.g. in the equal-probabilities 
case as at (30.72). 

Finally, we remark that the large-sample nature of the distribution theory of X? 
is not a disadvantage in practice, for we do not usually wish to test goodness-of-fit 
except in large samples. 


Recommendations for the X°’ test 
30.31 We summarize the above discussion with a few practical recommendations: 

(1) If the distribution being tested has been tabulated, use classes with equal, 
or nearly equal, probabilities. 

(2) Determine the number of classes when 2 exceeds 200 approximately by 
(30.72) with b between 2 and 4. 

(3) If parameters are to be estimated, use the ordinary ML estimators in the 
interests of efficiency, but recall that there is partial recovery of degrees of 
freedom (30.19) so that critical values should be adjusted upwards ; if the 
multinomial ML estimators are used, no such adjustment is necessary. 

None of the theory above will hold if the form (instead of the parameters alone) is 
estimated from the data used to test goodness of fit. 


30.32 Apart from the difficulties we have already discussed in connexion with X? 
tests, which are not very serious, they have been criticized on two counts. In each 
case, the criticism is of the power of the test. Firstly, the fact that the essential under- 
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lying device is the reduction of the problem to a multinomial distribution problem 
itself implies the necessity for grouping the observations into classes. In a broad 
general sense, we must lose information by grouping in this way, and we suspect that 
the loss will be greatest when we are testing the fit of a continuous distribution. 
Secondly, the fact that the X? statistic is based on the squares of the deviations of ob- 
served from hypothetical frequencies implies that the X? test will be insensitive to 
the patterns of signs of these deviations, which is clearly informative. The first of 
these criticisms is the more radical, since it must clearly lead to the search for other 
test statistics to replace X*, and we postpone discussion of such tests until after we have 
discussed the second criticism. 


The signs of deviations 

30.33 Let us consider how we should expect the pattern of deviations (of observed 
from hypothetical frequencies) to behave in some simple cases. Suppose that a simple 
hypothesis specifies a continuous unimodal distribution with location and scale para- 
meters, say equal to mean and standard deviation ; and suppose that the hypothetical 
mean is too high. For any set of k classes, the py; will be too small for low values of 
the variate, and too high thereafter, as illustrated in Fig. 30.1. Since in large samples 


yitue distribution 


ito distribution 


Frequency 


Variate —value 
Fig. 30.1—Hypothetical and true distributions differing in location 


the observed proportions will converge stochastically to the true probabilities, the 
pattern of signs of observed deviations will be a series of positive deviations followed by a 
series of negative deviations. Ifthe hypothetical mean is too low this pattern is reversed. 

Suppose now that the hypothetical value of the scale parameter is too low. The 
picture will now be as in Fig. 30.2. The pattern of deviations in large samples is now 


<—H, distribution 


Frequency 
> 
| xy 
ba) 
Q. 
> 
> 
~, 
S 
Cc 
a 
x. 
S 
‘ 

\ 
, 


Variate - value 
Fig. 30.2—Hypothetical and true distributions differing in scale 
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seen to be a series of positives, followed by a series of negatives, followed by positives 
again. If the hypothetical scale parameter is too high, all these signs are reversed. 
Now of course we do not knowingly use the X? test for changes in location and 
‘scale alone, since we can then find more powerful test statistics. However, when 
there is error in both location and scale parameters, Fig. 30.3 shows that the situation 


< H, distribution 


True distribution 


Frequency 


Variate-value 
Fig. 30.3—Hypothetical and true distributions differing in location and scale 


is essentially unchanged ; we shall still have three (or in more complicated cases, 
somewhat more) “‘ runs ”’ of signs of deviations. More generally, whenever the para- 
meters have true values differing from their hypothetical values, or when the true 
distributional form is one differing ‘“‘ smoothly ” from the hypothetical form, we expect 
the signs of deviations to cluster in this way instead of being distributed randomly, 
as they should be if the hypothetical frequencies were the true ones. 


30.34 This observation suggests that we supplement the X? test with a test of 
the number of runs of signs among the deviations, small numbers forming the critical 
region. The elementary theory of runs necessary for this purpose is given as Exer- 
cise 30.8. Before we can use it in any precise way, however, we must investigate the 
relationship between the “‘ runs ”’ test and the X* test. F.N. David (1947), Seal (1948) 
and Fraser (1950) showed that when H, holds the tests are asymptotically independent 
(cf. Exercise 30.7) and that for testing the simple hypothesis all patterns of signs are 
equiprobable, so that the distribution theory of Exercise 30.8 can be combined with 
the X? test as indicated in Exercise 30.9. 

The supplementation by the “ runs ”’ test is likely to be valuable in increasing sensi- 
tivity when testing a simple hypothesis, as in the illustrative discussion above. For 
the composite hypothesis of particular interest where tests of fit are concerned, when 
all parameters are to be estimated from the sample, it is of no practical value, since the 
patterns of signs of deviations, although independent of X?, are not equiprobable as 
in the simple hypothesis case, and the distribution theory of Exercise 30.8 is therefore 
of no use (cf. Fraser, 1950). 


Other tests of fit 
30.35 We now turn to the discussion of alternative tests of fit. Since these have 
striven to avoid the loss of information due to grouping suffered by the X? test, they 
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cannot avail themselves of multinomial simplicities, and we must expect their theory 
to be more difficult. Before we discuss the more important tests individually, we 
remark on a feature they have in common. 

It will have been noticed that, when using X? to test a simple hypothesis, its distri- 
bution is asymptotically v7 _, whatever the simple hypothesis may be, although its exact 
distribution does depend on the hypothetical distribution specified. It is clear that 
this result is achieved because of the intervention of the multinomial distribution and 
its tendency to joint normality. Moreover, the same is true of the composite hypothesis 
situation if multinomial ML estimators are used—in this case X*—> y?_ , , whatever 
the composite hypothesis may be, though its exact distribution is even more clearly seen 
to be dependent on the composite hypothesis concerned. When other estimators are 
used (even when fully efficient ordinary ML estimators are used) these pleasant asymp- 
totic properties do not hold: even the asymptotic distribution of X? now depends on 
the latent roots of the matrix (30.37), which are in general functions both of the hypo- 
thetical distribution and of the values of the parameters 6. 

We express these results by saying that, in the first two instances above, the distribu- 
tion of X* is asymptotically distribution-free (i.e. free of the influence of the hypothetical 
distribution’s form and parameters), whereas in the third instance it is not asymp- 
totically distribution-free or even parameter-free (i.e. free of the influence of the para- 
meters of Ff, without being distribution-free). 


30.36 We shall see that the most important alternative tests of fit all make use, 
directly or indirectly, of the probability-integral transformation, which we have en- 
countered on various occasions (e.g. 1.27, 24.11) as a means of transforming any known 
continuous distribution to the rectangular distribution on the interval (0,1). In our 
present notation, if we have a simple hypothesis of fit specifying a d.f. g(x), to which 


a f.f. f(x) corresponds, then the variable y = : fo(u)du = F,(x) is rectangularly 


distributed on (0,1). ‘Thus if we have a set of m observations x; and transform them 
to a new set y; by the probability-integral transformation for a known f(x), and use a 
function of the y; to test the departure of the y,; from rectangularity, the distribution of 
the test statistic will be distribution-free, not merely asymptotically but for any n. 

When the hypothetical distribution is composite, say F'y(x|0,,65,...,9;) with the 
§ parameters 0 to be estimated, we must select s functions 7¢,,...,¢; of the x, for this 
purpose. ‘The transformed variables are now 


MH 
{= | Fo(u| tr, fe + any tijade, 
but they are neither independent nor rectangularly distributed, and their distribution 
will depend in general both on the hypothetical distribution /’y and on the true values of 
its parameters, as F. N. David and Johnson (1948) showed in detail. However (cf. Exer- 
cise 30.10), if F has only parameters of location and scale, suitably invariantly estimated, 
the distribution of the y; will depend on the form of F but not on its parameters. It 
follows that for finite m, no test statistic based on the y, can be distribution-free for a 
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composite hypothesis of fit (although it may be parameter-free if only location and 
scale parameters are involved). Of course, such a test statistic may still be asymptoti- 
cally distribution-free. 


The Neyman-Barton “smooth” tests 

30.37. The first of the tests of fit, alternative to X2, which we shall discuss are the 
so-called “smooth” tests first developed by Neyman (1937a), who treated only the 
simple hypothesis, as we do now. Given H,: F(x) = Fo(x), we transform the n obser- 
vations x; as in 30.36 by the probability integral transformation 


y= | ” fo(u)du = Fy(w), i= 1,2,...., (30.73) 


and obtain 2 independent observations rectangularly distributed on the interval (0, 1) 
when H, holds. We specify alternatives to H, as departures from rectangularity of 
the y,, which nevertheless remain independent on (0,1). Neyman set up a system of 
distributions designed to allow the alternatives to vary smoothly from the H, (rect- 
angular) distribution in terms of a few parameters. (It is this “‘ smoothness ”’ of the 
alternatives which has been transferred, by hypallage, to become a description of the 
tests.) In fact, Neyman specified for the frequency function of any y, the alternatives 


ae 
f(y| Hx) = €(01,92,.--, 9%) exp jit = 6,209) Ga 5-= te 28 5 
r=1 
| (30.74) 
where c is a constant which ensures that (30.74) integrates to 1 and the a,(y) are 


Legendre polynomials transformed linearly so that they are orthonormal on the interval 
(0,1). If we write z = y—}, the polynomials are, to the fourth order, 


(2) =1 
7 (2) = +t 22, 
(2) = 5*.(62?—3), (30.75) 


13(z) = 7%.(202?— 332), 
74(%) = 3.(7024— 1527+ 3). 


30.38 The problem now is to find a test statistic for Hy against H;,. We can see 


(*) The Legendre polynomials, say L,(z), are usually defined by 
d*™ 
Lele) = (7! 275 (@?-1)'} 
and satisfy the orthogonality conditions 
: 0, r#S, 
| L,(2) L;(2) dz = 2 
a 2r+1’ 
To render them orthonormal on (—4, 3), therefore, we define polynomials 2,(z) by 
a2) = (2r+1)? L,(22) 


We could now transfer to the interval (0, 1) by writing y = z+3. It is more convenient, as 
in the text, to work in terms of z = y—#3. 
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that if we rewrite (30.74) as 


k 
S(y| Ax) = c(@)exp{ 6,20 (a) a 4 1 ee GL, (30.76) 
r=0 
defining 6, = 1, this includes Hy also. We wish to test the simple 
HW, 9, =), 2] =e, = 9, (30.77) 


or equivalently 


k 
H,: 3 @ = 0, (30.78) 
f=] 


against its composite negation. It will be seen that (30.76) is an alternative of the expo- 
nential family, linear in the 0, and z,. The Likelihood Function for m independent 
observations is 


L(y|8) = {c(@)}"exp {0 Ea, ( yh. (30.79) 


n 
(30.79) clearly factorizes into k parts, and each statistic ¢, = & 2,( y;) is sufficient for 
i=1 


§,, and we therefore may confine ourselves to functions of the ¢, in our search for a 

test statistic. When dealing with linear functions of the 0, in 23.27-32, we saw that the 

equivalent function of the ¢, gives a UMPU test. Here we are interested in the sum 

of squares of the parameters, and it seems reasonable to use the corresponding function 
k 


of the t,, i.e. & 72, as our test statistic, although we cannot expect it to have this strong 
= 


optimum property. This was, in fact, apart from a constant, the statistic proposed by 
Neyman (1937a), who used a large-sample argument to justify its choice. E. 5. Pearson 
(1938) showed that in large samples the statistic is equivalent to the LR test of (30.78). 
We write u, = nt,; the test statistic is then” 


k 1 k n 2 
pi = 2 =—- % {3 (v9 } : (30.80) 


Nr=1 = 
30.39 Since u, = n= 3 x,(¥;), the u, are asymptotically normally distributed by 
i=1 


the Central Limit theorem, with mean and variance obtained from (30.79) as 
E(u,) = n E{x,(y)} = n*6,, (30.81) 
var (u,) = var{z,(y)} = 1, (30.82) 


and they are uncorrelated since the z, are orthogonal. Thus the test statistic (30.80) 
is asymptotically a sum of squares of k independent normal variables with unit variances 
and means all zero on H,, but not otherwise. pj is therefore distributed asymptotically 
in the non-central y? form with k degrees of freedom and non-central parameter, 


from (30.81), 
k : 
Ae EF. (30.83) 
r=1 


It follows at once that p? is a consistent (and, by 24.17, asymptotically unbiassed) test, as 


(*) The statistic is usually written y?; we abandon this notation in accordance with our 
convention regarding Roman letters for statistics and Greek for parameters. 
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Neyman (1937a) showed. F. N. David (1939) found that, when H, holds, the simplest 
test statistics p? and p} are adequately approximated by the (central) y? distributions 
with 1 and 2 degrees of freedom respectively for n > 20. 


The formulation of alternative hypotheses 

30.40 The choice of k, the order of the system of alternatives (and the number 
of parameters by which the departure from H is expressed) has to be made before a 
test is obtained. Clearly, we want no more parameters than are necessary for the alter- 
native of interest, since they will ‘‘ dilute”’ the test. Unfortunately, one frequently 
has no very precise alternative in mind when testing fit. This is a very real difficulty, 
and may be compared with the choice of number of classes in the X* test. In the latter 
case, we found that the choice could be based on sample size and test size alone ; in 
our present uncertainty, there is no very clear guidance yet available. 


30.41 In the first of a series of papers, Barton (1953-6), on whose work the fol- 
lowing sections are based, has considered a slightly different general system of alter- 
natives. He defines, instead of (30.76), 


k 
tT (y|A) = & 6,2,(y), 0 69-6 1k eo ks (30.84) 
r=0 


with 0, =1 as before. No constant c(@) is now required, since 


i k k 1 
| {2 6.2% (9) dy 2 yas: 0, | ietopayee sd, (30.85) 
0 \r=0 r=1 0 


since )(y) = 1 and the x, are orthogonal. However, we now must ensure that 
(30.84) is non-negative over the interval (0, 1), and this involves restriction of possible 
values of the 0,. Thus, for example, with k = 1 the value of z, given in (30.75) indi- 
cates that we must restrict 6, by |6,| < 37. 

Now if we write 0, = n-*/,,r > 1, we see that we have a set of alternatives approach- 
ing H, as n—> «o. What is more, as n—> co we have 


1+n-XA,2,(y) ~ exp {n* XA, a, (y¥)}, 


so that the asymptotic distribution of pz for the alternatives (30.76) will apply under 
(30.84) with 6, = n-#1,. In order to obtain the asymptotic non-central y? distribu- 
I 


tion of p?, in which the non-central parameter is now A = & /;, we have had to let 
r=1 


H,, tend to H, asn—> o. ‘This is exactly what we did to obtain the corresponding 
result for the X? test in 30.27. 


30.42 If we do have a particular alternative distribution g() in mind, we can 
express it in terms of a member of the class (30.84) as follows. Let us choose para- 
meters 0, in (30.84) to minimize the integral 


Q? = | te()-f0o LE) Peay = i: | e(y)—{1+  o,-r(y) ] a (30.86) 
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Differentiating with respect to the 0,, we find the necessary conditions for a minimum 


\,* (9) E (9) {1 - £0.2-(9)} | dy=0, alls 


and using the orthogonality of the z,(y), this becomes 
E{x,(y)| Hi} = 4, all r. (30.87) 


The minimum value of (30.86) is, as in ordinary Least Squares theory, 
1 1( k 2 
Onin = [.{e()-1}8ay-]'4 E o,a(y)} dy 


1 k 
- | 2*(y)dy—1— 26, (30.88) 
0 r=1 


using the orthogonality again. 
O%.in Is non-negative by —— Regarded as a function of k, it is seen to be 


non-increasing, and, since only z 6? depends on k, O,in > 0 as R—> o0. In fitting 

the oe (30.84) to g(y), re we essentially have to judge the approximation 

of A= > 6; to 3 g?(y)dy—1, which bounds it above. The integral is in terms of 
r=1 


the probability-integral-transformed variable—it is often more convenient to evaluate 
it in terms of the alternative distribution of x, untransformed. Call this h(x). We 
then have, since g(y)dy = h(x) dx, 


| ed =] he (x) = de = | fad : TAA fas: (30.89) 


Example 30.5 
Consider the normal distribution 
h(x|) = (2)*exp{—3(x—-p)?}, -O<xK< om, 
with Hy:u = 0. Using (30.89), 


= es 3s h? (x | 1)) 
feore =|" trary 
(2n)-* |" exp{—(x—n)?+ 48} de 
exp (i?).(2n)-# |" exp{—3(x—2n)*}de 


exp (1). 


k 
Thus we must compare 4 = & 6? with 
r=1 


exp(u2)—1 = is = (30.90) 


From (30.87), 
bes | ame ()8 (9) dy. 
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Because the 2,(y—4) are even functions for even r (cf. (30.75) ) we have, since g( 4) is 
also even about the value 4 for this symmetrical alternative, 

os — 0. 
For odd r, we must evaluate individual terms. We find, using (30.75), 


p= [ 31 .2e9(2)dz = 31.2 | zh (x) de 
4 (oe) 


3.2 /" {[" (ul Hy) dua} b(%| Ha) exp {—B (u*—2u0)} de 


and 
s] 3.2[ xf {’ h(u| Hy) du— 4} h(x| Ho) ds 
du |u=0 =n == 

382A, 


where A, is Gini’s mean difference (cf. Exercise 2.9), equal to 2/z? in this normal case 


(cf. 10.14), so that 
a1..-@) 
dt \n=0 7 


Thus for small variations du in mu, 6, alone will vary by (3/2)? du = 0-98 du, and if 


we use the p? test with 0, = (3/2)? we see from (30.90) that 67 = : uw? will be very 
little less than the right-hand side, and we lose little efficiency in testing for a change 
in the mean yu of a normal distribution from zero. This is easily confirmed. ‘The 
large-sample distribution of 30.41 is the non-central y? with 1 degree of freedom and 
non-central parameter n(3/z)u2, equivalent to a standardized normal deviate of (31/z)*u. 
The best test, based on the sample, uses a normal deviate of n?u. The factor 


(3/2)? = 0-98 
will make little difference to the power in large samples. 


The advantage of the p? tests displayed here is that, given the alternative, we can 
choose k to give higher power than with the X? test. 


30.43 Since 30.37, we have confined ourselves to the simple hypothesis and un- 
grouped observations. We now turn to discussion of the grouping of data for the p; 
tests (which is in practice very necessary in view of the need for carrying out the prob- 
ability integral transformation on every observation) and the extension of the tests to 
composite hypotheses, which is perhaps even more important. ‘These subjects form 
the substance of the second and third of Barton’s (1953-6) papers. ‘The remarkable 
fact is that, once grouping has been carried out, the pj tests move into intimate relation- 
ship with the X? test. 

Suppose that the range of the variate x is grouped into k classes, and let &; be the 
median of the ith group from below. An obvious analogue of (30.73) is then 


fi i—1 
Ae | © folt) du = post Sion (30.91) 


TESTS OF FIT 449 


where the p,, are hypothetical probabilities as before. We take all the y,; in a class 
to be replaced by the value y;, and write z; = y;—} as before. We now require a set 
of orthogonal polynomials P,(y’) which will play the same role for grouped data as 
the standardized Legendre polynomials did in the ungrouped case. In view of the 
fact that the alternative hypothesis may now be formulated in terms of the variable x as 


fle|H,) = {3 0,P,(9)} fel Ho) 


we have after grouping into k classes the alternative hypothetical probabilities expressed 
in terms of those tested by 


Puls = { 5 0,P.(y')b Ps (30.92) 
It is therefore natural to specify the P,(y‘) by 
Po(y') = 1, (30.93) 


k 

2 PoP(WP(m) = 1, T= 5) y= 0,1,2,...,k-1. (30.94) 
=0, rt, 

Then (30.80) (with k—1 replacing k) becomes in grouped form, on using (30.91), 


k-1 1 k-—1 n 2 {*-1 k 2 
SS { : P.(y)f = zs mPe(} 
r=1 NM y=1\i=1 Nyr=1 i=l 
which by (30.93) iS 
y t-1( & 2 Pay Fe 2 
- 1S np.(oo} n= 34 ey PePe(Op —m 


=o i= (* Pou)? 
The summation is of the squares of weighted sums of the p); P,(¥;) (which are ortho- 
gonal by (30.94)), with weights 2;/(npo:)'. We therefore have 


k 2 
2, = 4 
Pe-1 tn = me nD oi 
or, in virtue of (30.8), 
Pe-1 = X? (30.95) 


exactly. 


30.44 Just as pz_; is identical with X? by (30.95), the lower-order tests fe 
(r = 1,2,...,k—2) can now be seen to be components or partitions of X?, particular 
functions of the asymptotically standardized normal variates 


N;—NPoi 
Xi = - 
(Poi)? 
which we now discuss. If we write 
k 
uy, = p> Lg X4y r= i= Zs —— k, (30.96) 
fest 


and choose the last row of the matrix L = {/,,;} to be 


Ly, — Pp ois 
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we have identically 
uy = 0, 


and if we choose the other elements of L so that it is orthogonal, i.e. 


k 
p> Lealss = i, y= 6, 

i=1 
=f, rhs, 


the.u, will also be asymptotically standardized normal variates and, since they are ortho- 
gonal, asymptotically independent. ‘Thus the sum of squares of any m(m=1, 2,..., 
k—1) of the u, (r = 1,2,...,k—1) will be distributed like X? with m degrees of freedom, 
independently of any other sum based on different u, We shall return to X? par- 
titioning problems in a particular context in Chapter 33. 

From this point of view, the virtue of the pi, p3, . . . tests, where the data are grouped, 
is that they select the appropriate functions of the y; for the test in hand to have maxi- 
mum power ; they isolate the important components of X2. 


30.45 With the remarks of 30.44 in mind, it is not surprising that when we come 
to consider the composite hypothesis, the theory of the grouped pj tests closely resembles 
that of the X? test already discussed in 30.11-21. All the principal results carry over, 
as Barton showed (cf. Watson, 1959): if multinomial ML estimators are used, degrees 
of freedom are reduced by one for each parameter estimated ; if the grouping is deter- 
mined from the observations, this makes no difference (under regularity conditions) 
to the asymptotic distributions. 

The main problem in the application of the pj tests to the composite hypothesis is 
that of choosing k. As we remarked in the simple hypothesis case, one often has 
no very precise alternative in mind in making a test of fit—otherwise one would, if 
possible, use a more specific test. In view of the fact that large samples are frequently 
used for tests of fit, so that grouping of observations is a practical necessity, the identity 
of the grouped pj_, test with the X? test means that, apart from partitioning problems, 
which are common to both types of test, there is no competition between them. 


Tests of fit based on the sample distribution function 


30.46 ‘The remaining general tests of fit are all functions of the cumulative distribu- 
tion of the sample, or sample distribution function, defined by 


=.= 
Yr 

Si, (x) = = X(r) SxX< XM r+1)s (30.97) 
, X(n) S 


The xq) are the order-statistics, i.e. the observations arranged so that 
X(1) < (2) +S X(n)- 


S,,(x) 1s simply the proportion of the observations not exceeding x. If F(x) is the 
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true d.f., fully specified, from which the observations come, we have, for each value of 
x, from the Strong Law of Large Numbers, 


lim P{S,(x) = Fo(x)} = 1, (30.98) 


and in fact stronger results are available concerning the convergence of the sample 
d.f. to the true df. 

In a sense, (30.98) is the fundamental relationship on which all statistical theory is 
based. If something like it did not hold, there would be no point in random sampling. 
In our present context, it is clear that a test of fit can be based on any measure of diverg- 
ence of S,,(x) and F,(x). We now suppose F'4(x) to be continuous. Consider the 
test statistic 


W? = | "_{Sa(x)—Fo(2)}*dFo(s), (30.99) 


which was proposed by Smirnov (1936) after earlier suggestions by H. Cramér and 
R. von Mises. Now, from binomial theory (Example 3.2) with p = F'y(x), 


B{ Sq(x) —Fy(x)}2 = Fo(s) (1 —Fo(#)}/m (30.100) 
Thus we have from (30.99) and (30.100) 
E(W?) = © | Fo(1— Fa) dF - (5-3) = zs (30.101) 


and similarly it may be established that 
var (W?) = E(W*)— E?(W*) 


4n—3 
- a (30.102) 


30.47. It will be noticed that the mean and variance of W? do not depend on Fy. 
In fact, the distribution of W? as a whole does not depend on Fy: the test is completely 
distribution-free for any 7. This is easily seen directly, for if we apply the prob- 
ability integral transformation (30.73) to x, we reduce (30.99) to 


W = | {Su(9)—9 8a (30.103) 


i.e. we have reduced the problem of fit to testing whether, in a sample from the rect- 
angular distribution on (0, 1), the sample departs too far from the d.f. of that distribu- 
tion, F(y) = y. ; 

From (30.101) and (30.102), it will be clear that the limiting distribution which 
must be sought is that of m W? (rather than the multiple nt which is commonly necessary 
because of the Central Limit theorem), which will have mean and variance asymptoti- 
cally of order zero in n. The asymptotic theory of nW? is difficult, and the exact 
theory for finite m is unknown. Smirnov (1936) showed that its limiting c.f. is 

fee Sores remedial g 30.104 
¢ (Z) =e {exp (tin W*) } = sin [(2it)*] f° (30.104) 
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Anderson and Darling (1952) inverted ¢(¢) into a form suitable for numerical calcula- 
tion, and tabulated the limiting distribution of n W? in inverse form, giving the values 
exceeded with probabilities 0-001, 0-01 (0-01) 0-99. Conventionally, the most important 
of their values for test purposes are: 


Test size « Critical value of 2 W? 


0-10 0:347 
» O05 0-461 
0-01 0-743 
0-001 1-168 


Large values of » W? form the critical region, as is evident from the motivation of the 
test. 

Marshall (1958) showed that the asymptotic distribution of m W? is reached remark- 
ably rapidly, the asymptotic critical values given above being adequate for n as low as 3. 


E. S. Pearson and Stephens (1962) fit Johnson Type Sg distributions (cf. 6.27-34) 
for n = 5, 10, ©, to obtain critical values. 


30.48 For the W? test, as for the ungrouped pj tests discussed in 30.37-42, one 
needs to calculate /,(x) for each individual observation. In fact, it may be shown 
that we may express the statistic as 


1 n ?; See 2 
we = ont = {Fe (0) a} ; (30.105) 


The W? test has been investigated for the composite hypothesis, with one parameter 
unspecified, by Darling (1955). The test statistic is now no longer distribution-free in 
general, as we should expect from the discussion of 30.36; the exception is when the 
parameter can be estimated with variance of order less than m~!, when the limiting distri- 
bution is just as in the simple hypothesis case (cf. 30.8, where we met the same phenomenon 
for X*). If the parameter is of location or scale, estimated with variance of order 27, 
the limiting distribution is parameter-free (cf. 30.36). 

Anderson and Darling (1952, 1954) investigated an alternative test statistic for the 
simple hypothesis. It is simply (30.99) with the factor [Fy (x) {1—F (x) }]~! inserted 
in the integrand. They tabulate critical values of its asymptotic distribution for 
a = 0-10, 0-05, 0-01. P. A. W. Lewis (1961) gives an exact tabulation of the d.f. for 
n = 1 and n—~> © and estimates based on sampling experiments for n = 2 (1) 8; the con- 
vergence to the asymptotic d.f. is extremely rapid. 

Watson (1961) shows that if we modify the nW®? statistic to 


1 1 2 
= n| | su) —F- | {Sn ()— Fy (oy }dFo (9 dF, (x), 
0 0 


the asymptotic distribution of z?U? is exactly that of nD,” given at (30.132) below. E.S. 
Pearson and Stephens (1962) and Stephens (1963, (1964) give theoretical and empirical 
results on the distribution of U?. 'Tiku (1965b) fits y? approximations for U? and also 
for W?. 


The Kolmogorov statistic 


30.49 We now come to the most important of the general tests of fit alternative 
to X*, Like W?, defined at (30.99), it is based on deviations of the sample d.f. S,, (x) 
from the completely specified continuous hypothetical d.f. Fy(x). The measure of 


TESTS OF FIT 453 


deviation used, however, is very much simpler, being the maximum absolute difference 
between S,(x) and Fy(x). ‘Thus we define 


D, = sup| S,(x)—Fo(a)|- (30.106) 


The appearance of the modulus in the definition (30.106) might lead us to expect 
difficulties in the investigation of the distribution of D,, but remarkably enough, the 
asymptotic distribution was obtained by Kolmogorov (1933) when he first proposed 
the statistic. 'The derivation which follows is due to Feller (1948). 


30.50 We first note that the distribution of D, is completely distribution-free 
when H, holds. We may see this very directly in this case, for if S,(x) and Fy (x) 
are plotted as ordinates against x as abscissa, D,, is simply the value of the largest vertical 
difference between them. Clearly, if we make any one-to-one transformation of x, 
this will not affect the vertical difference at any point and, in particular, the value of 
D,, will be unaffected. 


30.51 Now consider the values x19, %29,-+-)Xn—1,0 defined by 
Fy (xxo) = R/n. (30.107) 
(If, for some k, (30.107) holds within an interval, we take x; to be the lower end-point 
of the interval.) Let c be a positive integer. If, for some value x, 

S,,(x)—F (x) > c/n, (30.108) 
the inequality (30.108) will hold for all values of x in some interval at whose upper end- 
point x’ it becomes an equality, 1.e. 

Si, (x’) — Fg (x’) = c/n. (30.109) 
Since S,,(x) is by definition a step-function taking values which are multiples of 1/n, 
and c is an integer, it follows from (30.109) that F(x’) is a multiple of 1/n and thus, 
from (30.107), x’ = xx for some k, so that (30.109) becomes 

Sn (Xo) — Fo (no) = ¢/M, 


Si, (Xn0) = (R+c)/n. (30.110) 
From the definition of S,,(x) at (30.97), this means that exactly (k+c¢) of the observed 
values of x are less than x; 9, the hypothetical value below which k of them should fall. 
Conversely, if aie) < Xeo < X@+e+1)» (30.108) will follow immediately. We have 
therefore established the preliminary result that the inequality 

Sp (x) — Fo (*) = c/n 
holds for some x if and only if for some k 

X(k-+e) S Xeq < KX k+e+1) (30.111) 

We may therefore confine ourselves to consideration of the probability that (30.111) 
occurs. 


i.e. from (30.107), 


30.52 We denote the event (30.111) by A,(c). From (30.106), we see that the 
statistic D, will exceed c/n if and only if at least one of the 27 events 


A, (c), Ax(—©), As(c), A2(—0), -» +» An (0), An(—0) (30.112) 
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occurs. We now define the 2m mutually exclusive events U, and V,. U, occurs if 
A,(c) is the first event in the sequence (30.112) to occur, and V, occurs if A,(—c) 1s 
the first. Evidently 


P{D, > <} e E [P(U,}+P(V,}} (30.113) 

We have, from the definitions of A;,(c) and U,, V,, the relations 
P(A) = BIPM} PVD PAO ALO a 
P{As(—0)} = 8 [P(U,}P{Ar(—0)| Ar(O}+P{Ve}P(Ae(—9)] Ae (9 


From (30.111) and (30.107), we see that P{A;,(c)} is the probability that exactly 
(k +c) “ successes ” occur in m binomial trials with probability k/n, i.e., 


P(A} = (nic) (Fy: ae (30.115) 
Similarly, for r < k, 
Sica oat t: Pe ") od ( ss ee (30.116) 


P{Ax(0)| 4x(—9)} = t= aa3) e7" (1 2 Arye 


(30.115) and (30.116) hold for negative as well as positive c. Using them, we see that 
(30.114) is a set of 2m linear equations for the 2n unknowns P{U,}, P{V,}. If we 


solved these, and substituted into (30.113), we should obtain P< D, > “ for any c. 


30.53 If we now write 
kk+e : 
pr (c) = e* tor (30.117) 


we have 


P{ Aj, (c)} = pu (c) Pn—z(—¢)/Pn (0), 
P{ A, (c)| Ar ()} = Px-r (0) Pn—e(—©)/Pu—r(— 9); (30.118) 
P{ A, (c)| A,(—¢)} = Pr—r(2€) Pn—z(—©)/Pn—r(€); 


so that if we define 


n (0) pn (0) 
a, £0, Pall , v, = P{V,} */,, - 30.119 
ee eG ba -e(6) 
and substitute (30.115-19) into (30.114), the latter becomes simply 


pec) = > [ucPr-+(0) + erPe-r(2e)] 


(30.120) 
px(—©) = [u, Per (—2c) + 0, Per (0) ]. 


r=1 
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The system (30.120) is to be solved for 


E[P(U}+ PV} = 5G E, Por(— tet Par(e)e} (80.121) 
We therefore define 
: " ~ : = = 30.122 
Pr = Pn Or E pe (- c)u Tk Pn 10 2 i (c)v ( ) 
so that, from (30.121), 
x [P{U,}+P{V.}] = pat dn (30.123) 
f=1 
We now set.up generating functions for the p, and g,, namely 
G,(t) = E pet, Gt) = Vat’. 
k=1 k=1 
If we also define generating functions for the u,, v, and (for convenience) n-? p, (c), 
namely 
G,() = Dut, G,(t) = Bot, 
bat k=1 
and 
G(t,c) =n p,(c)t*, 
: k=1 
we have from (30.122), the relationships 


G,(t) = G,(t) G(t, —c)n'/p, (0), 
G,(t) = G, (t) G(t, c) n*/p, (0). } (30.124) 


30.54 We now consider the limiting form of (30.124). We put 
c= ant 
and let nm —> o and c —> o with it so that z remains fixed. 
We see from (30.117) that p;(c) is simply the probability of the value (k+c) for 
a Poisson variate with parameter k, i.e. the probability of its being c/k* standard devia- 
tions above its mean. If k/n tends to some fixed value m, then as the Poisson variate 
Ps () > Qnk)-*exp (—} 


tends to normality 
or, putting k = mn, c = zn', 
2 
n* p, (zn?) —> (2% m)-* exp (-+5). (30.125) 


Now since G(Z,c) is a generating function for the n-*p;(c), we have 


co 
G(e—#/", zn?) = n 3X p,(znt)e—*/" 
bo} 


ay “be 


and under our limiting process this tends by (30.125) to 


9) 2 
lim G(e—/”, en?) = (22)? | m-* exp ( —tm—4 =, dn (30.126) 
n—> co 0 m 
aa 
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If we differentiate the integral J on the right of (30.126) with respect to 422, we find 
the simple differential equation 


ae eS a 
a(3z*) 4s? 


ro (7) exp{—(2t 2*)F}. 


t 


whose solution is 


Thus 

: lim G(e—“", gn*) = (2t)-* exp { — (2¢ 2”)? }. (30.127) 
n—> oo 
(30.127) is an even function of 2, and therefore of c. 

Since, from (30.120), 

G(t,c) = G,(t) G(t,0)+ G, (4) Gj, 2c), 
G(t, —c) = G,()G(t, —2c) + Gp (4) G4, 0), 

this evenness of (30.127) in c gives us 
lim G,(e-“”") = lim G,(e-”") 
—> oo n—> 0 


lim G(e-“", zn?) 
lim G (e~“/”, 0) + lim G (e~", 2 n*) 
_ = {—(2t 2?) } 
1 +exp{—(8t2?)F ? See 
by (30.127). Thus, in (30.124), remembering that 
Pn(0) ~ (2x0), 


(30.128) 


(30.127) and (30.129) give 
lim 2'G,(¢-"S = im 2“6,¢ ~~ )— 2a\? exp{—(8t2°)?} L 
n—> oo n—> (2). 


2t] 1+exp{—(8t2?)!} 
This may be expanded into geometric series as 
L(t) = (F) 2 (—1)'-1 exp {—(8¢72.22)#}. (30.130) 
r=1 


By the same integration as at (30.126), L(t) is seen to be the one-sided Laplace trans- 
form 3 e~™ f(m) dm of the function 
0 


f(m) = % (—1)-texp{—2r2z2/m}. (30.131) 
r=1 
(30.131) is thus the result of inverting either of the limiting generating functions of 
the p, or g,, of which the first is 
limn-2G,(e-") = limn-1E ppe-#/" = | (lim p,) e~™ dm. 
k=1 0 


From (30.113) and (30.123), we require only the value (p,+9n). We thus put k = n, 
ie. m = 1, in (30.131) and after multiplying by two, obtain our final result 


lim P{D, > zn} = 2 5 (—1)texp{—2r22?}. (30.132) 
r=1 


n—> oo 
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Smirnov (1948) tabulates (30.132) (actually its complement) for z = 0:28 (0-01) 
2:50 (0:05) 3:00 to 6 d.p. or more. This is the whole effective range of the limiting 
distribution. 


30.55 As well as deriving the limiting result (30.132), Kolmogorov (1933) gave 
recurrence relations for finite 7, which have since been used to tabulate the distribution of 
Dy. Z.W. Birnbaum (1952) gives tables of P {D, < c/n} to 5 d.p., for m = 1(1)100 and 
c = 1(1)15, and inverse tables of the values of D,, for which this probability is 0-95 for 
nm = 2(1)5 (5) 30 (10) 100 and for which the probability is 0-99 for n = 2 (1) 5 (5) 30 (10) 80. 
L. H. Miller (1956) gives inverse tables for m = 1 (1) 100 and probabilities 0-90, 0-95, 0-98, 
0-99. Massey (1950a, 1951a) had previously given P {Dy < c/n} for n = 5 (5)80 and 
selected values of c < 9, and also inverse tables for m = 1 (1) 20 (5) 35 and probabilities 0:80, 
0-85, 0-90, 0:95, 0-99. 

It emerges that the critical values of the asymptotic distribution are : 


Test size Critical value of D, 
0-05 1-3581 n-:2, 
0-01 1:6276 n-:, 


and that these are always greater than the exact values for finite n. The approximation 
for these values of « is satisfactory at n = 80. 


Confidence limits for distribution functions 

30.56 Because the distribution of D, is distribution-free and adequately known 
for all n, and because it uses as its measure of divergence the maximum absolute devia- 
tion between S,,(x) and F’,(x), we may reverse the procedure of testing for fit and use 
D,, to set confidence limits for a (continuous) distribution function as a whole. For, 
whatever the true F(x), we have, if d, is the critical value of D,, for test size OL, 


Pip, = sup | Sn (x)— F(x) | = ao 


Thus we may invert this into the confidence statement 

P{S, (x)—d, < F(x) < S,(«)+d,, allx} = 1—«. (30.133) 
Thus we simply set up a band of width +d, around the sample d.f. S,,(x), and there 
is probability 1 —« that the true F(x) lies entirely within this band. This is a remark- 
ably simple and direct method of estimating a distribution function. No other test 
of fit permits this inversion of test into confidence interval since none uses so direct 
and simply interpretable a measure of divergence as D,,. 

One can draw useful conclusions from this confidence interval technique as to the 
sample size necessary to approximate a d.f. closely. For example, from the critical 
values given at the end of 30.55, it follows that a sample of 100 observations would have 
probability 0-95 of having its sample d.f. everywhere within 0-13581 of the true d.f. 
To be within 0-05 of the true d.f. everywhere, with probability 0-99, would require 
a sample size of (1:6276/0-05)?, i.e. more than 1000. 


Noether (1963) shows that the left side of (30.133) holds with probability > 1—« for 
discrete distributions. "Thus the D»z test is then also conservative. 


30.57 Because it is a modular quantity, D,, does not permit us to set one-sided 
confidence intervals for F(x), but we may consider positive deviations only and define 
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Dj, = sup{S,(x)—F(x)} - (30.134) 


as was done by Wald and Wolfowitz (1939) and Smirnov (1939a). 

To obtain the limiting distribution of Dj, we retrace the argument of 30.51-54. 
We now consider only events A; (c) with c > 0 in (30.112). U, is defined as before, 
but V, is not considered. (30.114) is replaced by 


P{A,(0)} = ¥ P(U,}P{ ALO 1A-()} 


and (30.128) by 
G(t,c) = G,(t) G(t, 0). (30.135) 
Instead of (30.129), we therefore have, using (30.127) and (30.135), 
lim G,(e—“”") = exp {—(2¢2?)?}. 


n—-> 
The first equation in (30.124) holds, and we get, in the same way as before, 
2a \* 
lim n—1G,(e-””) = (=z) exp {—(8¢27)?}. (30.136) 
n—> oO t 


Again from (30.127), (30.136) is seen to be the one-sided Laplace transform of 
f(m) = m-exp (—23%/m) 
and substitution of m = 1 as before gives 


lim P{Dj, > zn} = exp(—22"), (30.137) 
n—> 0 
which is Smirnov’s (1939a) result. (30.137) may be rewritten 
lim P{2n(Dj,)? < 227} = 1—exp(—22?). (30.138) 
n—->o 


Differentiation of (30.138) with respect to (22?) shows that the variable y = 2n(D7,)? is 
asymptotically distributed in the negative exponential form 

dF(y) = exp(—y)dy, O<yY< &. 
Alternatively, we may express this by saying that 2y = 4u(D;,)? is asymptotically a ;? 
variate with 2 degrees of freedom. Evidently, exactly the same theory will hold if we 
consider only negative deviations. 


30.58 Z. W. Birnbaum and Tingey (1951) give an expression for the exact distribution 
of Dt, and tabulate the values it exceeds with probabilities 0-10, 0-05, 0-01, 0-001, for 
n = 5, 8, 10, 20, 40, 50. As for Dn, the asymptotic values exceed the exact values, and 


the differences are small for m = 50. 
We may evidently use D+? to obtain one-sided confidence regions of the form 
P {Sp(x)—dt < F(x) } =1-«, where dj is the critical value of Dt. 


Comparison of Kolmogorov’s statistic with X? 

30.59 Nothing is known in general of the behaviour of the D, statistic when para- 
meters are to be estimated in testing a composite hypothesis of fit, although its use in 
testing normality has been studied—cf. 30.63. It will clearly not remain distribution- 
free under these circumstances (cf. 30.36), and this represents a substantial disadvantage 
compared with the X? test. However, it has the advantage of permitting the setting 
of confidence intervals for the present d.f., given only that the latter 1s continuous. 
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Because of the strong convergence of S,,(«) to the true d.f. F(x) (cf. (30.98) ), the 
D,, test is consistent against any alternative G(x) 4 F(x). However, Massey (1950b, 
1952) has given an example in which it is biassed (cf. Exercise 30.16). He also estab- 
lished a lower bound to the power of the test in large samples as follows. 


30.60 Write F,(x) for the d.f. under the alternative hypothesis H,, F(x) for the 
d.f. being tested as before ; and 


A = sup| F, (x)—Fo(a)|. (30.139) 


If d, is the critical value of D, as before, the power we require is 
P= P{sup| 5, (*)— Fo(+)| > d,| H,}. 


This is the probability of an inequality arising for some x. Clearly this is no less than 
the probability that it occurs at any particular value of x. Let us choose a particular 
value, x,, at which F, and F, are at their farthest apart, 1.e. 


A = F,(x,)—F (xa). (30.140) 
Thus we have 
P = P{| S,(%,)— Fo (a) | a d,|H,} 
or 


Ps 1—P(F,(x,)=d, < S, (xs) < F,(x,)+d,| By}. (30.141) 


Now, S,(x,) is binomially distributed with probability F(x.) of falling below x,. 
Thus we may approximate the right-hand side of (30.141) using the normal approxima- 
tion to the binomial distribution, i.e. asymptotically 


F,—F +d, 
{F,(1—F,)/n}3 


P > 1-(22)3 exp (—4u®) du, (30.142) 


F,—F,—d, 
{F,(1—F,)/n}2 
F,, and F, being evaluated at x, in (30.142) and hereafter. If F', is specified, (30.142) 


is the required lower bound for the power. Clearly, as » —> oo both limits of integra- 
tion increase. If 


Psi paps x (30.143) 


they will both tend to + o if Fy > F, and to —o if Fy < F,. Thus the integral 
will tend to zero and the power to 1. As 2 increases, d, declines, so (30.143) 1s always 
ultimately satisfied. Hence the power — 1 and the test is consistent. 

If F, is not completely specified, we may still obtain a (worse) lower bound to the 
power from (30.142). Since F,(1—F,) <4, we have, for large enough 2, 


2n2 (Fo— F,+ dy) — 
> 1-—(22) = — i 
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which, using the symmetry of the normal distribution, if F, < F,, we may write as 


2n2 (A+ dy) 
P > 1—(2n)-? | exp (— 4u?) du. (30.144) 


2n2(A—d,) 
The bound (30.144) is in terms of the maximum deviation A alone. 


7. W. Birnbaum (1953) obtained sharp upper and lower bounds for the power of 
D+ in terms of A. 


30.61 Using (30.144) and calculations made by Williams (1950), Massey (1951a) 
compared the values of A for which the large-sample powers of the X? and the D,, tests 
are at least 0:5. For test size « = 0-05, the D,, test can detect with power 0-5 a A about 
half the magnitude of that which the X? test can detect with this power; even with 
n = 200, the ratio of A’s is 0-6, and it declines steadily in favour of D,, as m increases. 
For « = 0-01 the relative performances are very similar. Since this comparison is 
based on the poor lower bound (30.144) to the power of D,, we must conclude that D,, is 
a very much more sensitive test for the fit of a continuous distribution. 

Kac et al. (1955) point out that if the Mann—Wald equal-probabilities procedure 
of 30.28-9 is used, the X2 test requires A to be of order n-*/> to attain power 3, whereas 
D, requires A to be of order n~-*. ‘Thus D,, asymptotically requires sample size to be 
of order n*/> compared to m for the X® test, and is asymptotically very much more 
efficient—in fact the relative efficiency of X? will tend to zero as m increases. 

A detailed review of the theory of the W, D, and related tests is given by Darling 


(1957). 


Computation of D, 

30.62 If we are setting confidence limits for the unknown F(x), no computations 
are required beyond the simple calculation of 5, («) and the setting of bounds distant 
+d, from it. In using D, for testing, however, we have to face the possibility of cal- 
culating F',(x) for every observed value of x, a procedure which is tedious even when 
F(x) is well tabulated. However, because the test criterion is the maximum devia- 
tion between S;,,(«) and F,(x), it is often possible by preliminary examination of the 
data to locate the intervals in which the deviations are likely to be large. If initial 
calculations are made only for these values, computations may be stopped as soon as 
a single deviation exceeding d, is found. (This abbreviation of the calculations is not 
possible for statistics like W?, which depend on all deviations.) 

A further considerable saving of labour may be effected as in the following example, 


due to Z. W. Birnbaum (1952). 


Example 30.6 

A sample of 40 observations is to hand, where values are arranged in order: 
0-0475, 0-2153, 0-2287, 0-2824, 0-3743, 0-3868, 0-4421, 0-5033, 0:5945, 0-6004, 0-6255, 
0-6331, 0-6478, 0:7867, 0-8878, 0-8930, 0:9335, 0-9602, 1-0448, 1-0556, 1-0894, 1-0999, 
1:1765, 1-2036, 1-2344, 1-2543, 1-2712, 1-3507, 1-3515, 1-3528, 1-3774, 1-4209, 1-4304, 
15137, 1:5288, 1-5291, 1-5677, 1-7238, 1-7919, 1-8794. 
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We wish to test, with « = 0-05, whether the parent F'y(x) is normal with mean 1 and 
variance 6. From Z. W. Birnbaum’s (1952) tables we find for n = 40, « = 0-05 that 
d, = 0:2101. Consider the smallest observation, xq) To be acceptable, Fy (x,y) 
should lie between 0 and d,, i.e. in the interval (0, 0-2101). The observed value of 
xa) is 0-0475, and from tables of the normal d.f. we find Fo (xq)) = 90-0098, within the 
above interval, so the hypothesis is not rejected by this observation. Further, it cannot 
possibly be rejected by the next higher observations until we reach an x) for which 
either (a) 1/40—0:2101 > 0-0098, ie. 7 > 8-796, or (b) F(x) > 0-2101+1/40, i.e. 
xi) > 0-7052 (from the tables again). The 1/40 is added on the right of (b) because 
we know that S,,(x(;) > 1/40 fori > 1. Now from the data, x) > 0°7052 for 7 > 14. 
We next need, therefore, to examine 7 = 9 (from the inequality (a)). We find there 
the acceptance interval for Fy) (x@)) 

(S,(x)—d,, Sg(x)+d,) = (9/40 —0-2101, 8/40 +0-2101) = (0-0149, 0-4101). 
We find from the tables Fy(x@) = F,(0:5945) = 0-1603, which is acceptable. To 
reject Hy, we now require either 

1/40 —0:2101 > 0-1603, i.e. 2 > 14-82 

or Fy(x@) > 0:-4101+1/40, ie. x@ > 0-9052, i.e. ¢ > 17. 
We therefore proceed to 7 = 15, and so on. The reader should verify that only the 
6 values 7 = 1, 9, 15, 21, 27, 34 require computations in this case. ‘The hypothesis is 
accepted because in every one of these six cases the value of Fy lies in the confidence 
interval; it would have been rejected, and computations ceased, if any one value had 
lain outside the interval. 


Tests of normality 

30.63 To conclude this chapter, we refer briefly to the problem of testing nor- 
mality, i.e. the problem of testing whether the parent d.f. is a member of the family of 
normal distributions, the parameters being unspecified. Of course, any general test 
of fit for the composite hypothesis may be employed to test normality, and to this 
extent no new discussion is necessary. However, it is common to test the observed 
moment ratios b, and b,, or simple functions of them, against their distributions given 
the hypothesis of normality (cf. 12.18 and Exercises 12.9-10) and these are sometimes 
called ‘‘ tests of normality.” This is a very loose description, and they are better called 
tests of skewness and kurtosis respectively. See 32.24 below. Geary (e.g. 1947) has 
developed and investigated an alternative test of kurtosis based on the ratio of sample 
mean deviation to standard deviation which is tabulated in the Biometrika Tables, as 


Vb, and b, are. 

Kac et al. (1955) discuss the distributions of D, and W? in testing normality when 
the two parameters (, 0”) are estimated from the sample by («, s*). The limiting dis- 
tributions are parameter-free (because these are location and scale parameters—cf. 
30.36) but are not obtained explicitly. Some sampling experiments are reported which 
give empirical estimates of these distributions. 

Shapiro and Wilk (1965) give a new criterion for testing normality based on the 
regression of the order-statistics upon their expected values, using the theory of 19.18-20 
and extensive sampling experiments to establish its distribution. 
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EXERCISES 


30.1 Show that if, in testing a composite hypothesis, an inconsistent set of estimators 
t is used, the statistic X?—> © as n—> OO, 
(cf. Fisher, 1924c) 


30.2 Using (30.33), show that the matrix M defined at (30.37) reduces, when the 
vector of multinomial ML estimators @ is used, to 


1s ae 
(8, 9) 


and that M is idempotent with tr M = s. Hence confirm the result of 30.10 and 30.14 
that X? is asymptotically distributed like y27_,_, when @ is used. 
(Watson, 1959) 


30.3 Show from the limiting joint normality of the m that as nm —> oo, the variance 
of the simple-hypothesis X? statistic in the equal-probabilities case (po: = 1/hk) is 


var(X") = ref p2- (2 rs) } +4(n=1) ef p.- (2 Pi) } 


where p,i, 1 = 1, 2, ..., k are the true class-probabilities. Verify that this reduces to 
the correct value 2(kR—1) when 


Pit = Pot = 1/Rk. 
(Mann and Wald, 1942) 


30.4. Establish the non-central y? result of 30.27 for the alternative hypothesis distri- 
bution of the X? test statistic. : 
(cf. Cochran, 1952) 


30.5 Show from the moments of the multinomial distribution (cf. (5.80)) that the 
exact variance of the simple-hypothesis X? statistic is given by 


nvar (X*) = 2n—1){2(n- NEE On—3) (=ft) -2(ztH)(zes) 


of i Pa) \ i Pot 
2 \2 
+3528 (y Et) 4a ee 
i Poi i Doi i Dou 


(Patnaik, 1949) 


30.6 For the same alternative hypothesis as in Example 30.3, namely the Gamma 
distribution with parameter 1:5, use the Biometrika Tables to obtain the p,; for the 
unequal-probabilities four-class grouping in Example 30.2. Calculate the non-central 
parameter (30.62) for this case, and show by comparison with Example 30.3 that the 
unequal-probabilities grouping would require about a 25 per cent larger sample than the 
equal-probabilities grouping in order to attain the same power against this alternative. 


30.7. k independent standardized normal variables x; are subject to c homogeneous 


k 
linear constraints. Show that S = & xj is distributed independently of the signs of 
j=1 : 
k 
the xy. If c = 1, and the constraint is & x; = 0, show that all sequences of signs are 
j=l 
equiprobable (except all signs positive, or all signs negative, which cannot occur), but 
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that this is not so generally for c > 1. Hence show that any test based on the sequence 
of signs of the deviations of observed from hypothetical frequencies (mi — fot) is asympto- 
tically independent of the X? test when Hy, holds. 

(F. N. David, 1947; Seal, 1948; Fraser, 1950) 


30.8 M elements of one kind and N of another are arranged in a sequence at random 
(M, N> 0). A run is defined as a subsequence of elements of one kind immediately 
preceded and succeeded by elements of the other kind. Let R be the number of runs 
in the whole sequence (2 < R < M+WN). Show that 


ranmn=2(°51) (1) C4) 
rms (ME) OCICS) 


and that 
2M N 
eee FS 
2MN(2MN-—-M-N) 
var R= 


(M+N)?(M+N-—1) — 


(Stevens (1939) ; Wald and Wolfowitz (1940). Swed and Eisen- 
hart (1943) tabulate the distribution of R for M < N < 20.) 


30.9 From Exercises 30.7 and 30.8, show that if there are M positive and N negative 
deviations (ni—np i), we may use the runs test to supplement the X? test for the simple 
hypothesis. From Exercise 16.4, show that if P, is the probability of a value of X? not 
less than that observed and P, is the probability of a value of R not greater than that 
observed, then U = —2 (log P,+log P,) is asymptotically distributed like y? with 4 
degrees of freedom, large values of U forming the critical region for the combined test. 


(F. N. David, 1947) 


30.10 x1, %2, ..., Xn are independent random variables with the same distribution 
f (X|6,, 6). 6, and 0, are estimated by statistics ¢, (%1, Xo, ..., ¥n), te(%1, Xa, »-- 5 Xn)- 
Show that the random variables 


MY 
yi = | f (u| t1, t2) du 


are not independent and that they have a distribution depending in general on f, 9, and 
6,; but that if 0, and 9, are respectively location and scale parameters and t,, t, are suit- 
ably invariant estimators, the distribution of y; is not dependent on 0, and 6,, but on the 
form of f alone. 


(F. N. David and Johnson, 1948) 


30.11 Show that for testing a composite hypothesis the X 2 test statistic using multi- 
nomial ML estimators is asymptotically equivalent to the LR test statistic when H, holds. 


30.12 Show that Neyman’s goodness-of-fit statistic (30.80) is equivalent to the LR 
test of the simple hypothesis (30.78) in large samples. 
(E. S. Pearson, 1938) 


30.13 Verify the values of the mean and variance (30.81-2). 
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30.14 Prove formula (30.102) for the variance of W?. 


30.15 Verify that 2 W? may be expressed in the form (30.105). 


30.16 In testing a simple hypothesis specifying a d.f. Fy) (x), show diagrammatically 
that for a simple alternative F',(x) satisfying 


F(x) < Fo(x) when F(x) < dy, 
F(x) = Fo(x) elsewhere, 


the D, test (with critical value d,) may be biassed. 
(Massey, 1950b, 1952) 


30.17. A random sample of observations u; is taken from the rectangular distri- 
bution on the interval (0, 1), dividing that interval into (n+1) lengths cj, where cj > 0 


n+1 

and Xcj = 1. The cj are ordered so that cq) < Ca) <... < Cnti). Show that the non- 
j=1 

negative variables 


‘e = (n+ 1)eq), 
n+1 


f= (n+2—J) (ey) — CG); Py = - a oe net; Se as :; 
j= 


have the distribution 
dF, = n!dg,... dfn, 

and that the unordered c; also have this distribution, so that 
af = aide, ... atx 


(the (n+1)th variable being omitted in each case to remove the singularity of the distri- 
bution). Hence show that the variables 


= 
7 ee ae = St Se ees 
j=1 
are distributed exactly as the order-statistics of the original sample, mr), 7 = te Spec 
Thus any test of fit based on the probability-integral transformation may be applied 
to the wy, as well as to the m7) obtained from the transformation. 


(Durbin (1961), who finds from sampling experiments that a one-sided 
Kolmogorov test (D,) applied to the w, has better power properties than the 
ordinary two-sided Dn test for detecting changes in distributional form) 


30.18 Let @ be the unspecified parameters in testing a composite hypothesis of fit 
for n observations x. Suppose that t, with less than » components, is minimal sufficient 
for @, and that we can make a 1 —1 transformation from x to (, u), where u is distributed 
independently of t. Show that if the value of t is discarded, and is replaced by a random 
observation t’ from its distribution with a known value of 9, then the set of observations 
x’ obtained by the inverse transformation from (t’, u) is distributed independently of 9, 


so that the hypothesis of fit becomes simple. 
(Durbin, 1961) 


CHAPTER 34 
ROBUST AND DISTRIBUTION-FREE PROCEDURES 


31.1 In the course of our examination of the various aspects of statistical theory 
which we have so far encountered, we have found on many occasions that excellent 
progress can be made when the underlying parent populations are normal in form. 
The basic reason for this is the spherical symmetry which characterizes normality, but 
this is not our present concern. What we have now to discuss is the extent to which 
we are likely to be justified if we apply this so-called ‘‘ normal theory ” in circum- 
stances where the underlying distributions are not in fact normal. For, in the light 
of the relative abundance of theoretical results in the normal case, there is undoubtedly 
a temptation to regard distributions as normal unless otherwise proven, and to use the 
standard normal theory wherever possible. ‘The question is whether such optimistic 
assumptions of normality are likely to be seriously misleading. 

We may formulate the problem more precisely for hypothesis-testing problems in 
the manner of our discussion of similar regions in 23.4. There, it will be recalled, 
we were concerned to establish the size of a test at a value «, irrespective of the values 
of some nuisance parameters. Our present question is of essentially the same kind, 
but it relates to the form of the underlying distribution itself rather than to its unspeci- 
fied parameters: is the test size « sensitive to changes in the distributional form ? 

A statistical procedure which is insensitive to departures from the assumptions 
which underlie it is called ‘‘ robust,” an apt term introduced by Box (1953) and now 
in general use. Studies of robustness have been carried out by many writers. A good 
deal of their work has been concerned with the Analysis of Variance, and we postpone 
discussion of this until Volume 3. At present, we confine ourselves to the results 
relevant to the procedures we have already encountered. Box and Andersen (1955) 
survey the subject generally. 


The robustness of the standard “normal theory ” procedures 


31.2 Beginning with early experimental studies, notably by E. S. Pearson, the 
examination of robustness was continued by means of theoretical investigations, among 
which those of Bartlett (1935a), Geary (1936, 1947) and Gayen (1949-1951) are essen- 
tially similar in form. ‘The observations are taken to come from parent populations 
specified by Gram—Charlier or Edgeworth series expansions, and corrective terms, to 
be added to the normal theory, are obtained as functions of the standardized higher 
cumulants, particularly x; and x,. ‘Their results may broadly be summarized by the 
statement that whereas tests on population means (i.e. “‘ Student’s ” ¢-tests for the 
mean of a normal population and for the difference between the means of two normal 
populations with the same variance) are rather insensitive to departures from normality, 
tests on variances (i.e. the y test for the variance of a normal population, the F-test 
for the ratio of two normal population variances, and the modified LR test for the 
equality of several normal variances in Examples 24.4, 24.6) are very sensitive to such 
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departures. ‘Tests on means are robust ; by comparison, tests on variances can only 
be described as frail. We have not the space here for a detailed derivation of these 
results, but it is easy to explain them in general terms. 


31.3 The crucial point in the derivation of “‘ Student’s”’ ¢-distribution is the 
independence of its numerator and denominator, which holds exactly only for normal 
parent populations. If we are sampling from non-normal populations, the Central 
Limit theorem nevertheless assures us that the sample mean and the unbiassed variance 
estimator s? = k, will be asymptotically normally distributed. What is more, we know 
from Rule 10 for the sampling cumulants of A-statistics in 12.14 that 


K(21) = k,/n, (31.1) 
ef I = Ole 7). (31.2) 
Since 
(12) = ka/n, (22) = + ns 
2/ "9 n ee > 
we have from (31.1) for the asymptotic correlation between * and s? 
p = K3/{k_ (4+ 2x5) }}. (31.3) 


If the non-normal population is symmetrical, «; and p of (31.3) are zero, and 
hence * and s? are asymptotically independent, so that the normal theory will hold 
for n large enough. If «x, 4 0, (31.3) will be smaller when «x, is large, but will remain 
non-zero. ‘The situation is saved, however, by the fact that the exact ‘‘ Student ”’ 
t-distribution itself approaches normality as »—> oo, as also, by the Central Limit 
theorem, does the distribution of 
t = (ap) /(s?/n)}, (31.4) 
since s? converges stochastically to o?. The two limiting distributions are the same. 
Thus, whatever the parent distribution, the statistic (31.4) tends to normality, and 
hence to the limiting normal theory. If the parent is symmetrical we may expect the 
statistic to approach its normal theory distribution (‘‘Student’s ”’)morerapidly. ‘Thisis, 
in fact, what the detailed investigations have confirmed: for small samples the normal 
theory is less robust in the face of parent skewness than for departure from mesokurtosis. 


31.4 Similarly for the two-sample ‘‘ Student’s”’ t-statistic. If the two samples 
come from the same non-normal population and we use the normal test statistic 


we find that the covariance between (*,—#,) and the term in square brackets in the 
denominator, say s’, is given by 


n,—1 1 n,— 1 1 K3 1 1 
cov = K3 4 OO eo 
N,+n,—2 Ny Ny+n,—2 No Ny+n,—2 No Ny 
while the variances corresponding to this are 
var (%,—%.) = K + * 
1 2 2 Ny Ne ’ 


var (s*) ~ (kg+2«3)/(m, +12). 
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The correlation is therefore asymptotically 


ee (myn) (-->} (31.6) 


{Ke (kg+2e3) }F my+m,—2\n, my 
Again, if x, = 0, the asymptotic normality carries asymptotic independence with it. 
We also see that p is zero if m, = mz. In any case, as m, and nm, become large, the 
Central Limit theorem brings (31.5) to asymptotic normality and hence to agreement 
with the “‘ Student’s ” ¢-distribution. 

Once again, these are precisely the results found by Bartlett (1935) and Gayen 
(1949-1951): if sample sizes are equal, even skewness in the parent is of little effect 
in disturbing normal theory. If the parent is symmetrical, the test will be robust 
even for differing sample sizes. 


31.5 Studies have also been made of the effects of more complicated departures 
from normality in ‘“‘ Student’s” t-tests. Hyrenius (1950) considered sampling from a 
compound normal distribution, and other Swedish writers, the most recent of whom 
is Zackrisson (1959) who gives references to earlier work, have considered various forms 
of populations composed of normal sub-populations. Robbins (1948) obtains the distri- 
bution of t when the observations come from normal populations differing only in means. 
For the two-sample test, Geary (1947) and Gayen (1949-1951) permit the samples to 
emanate from different populations. 


31.6 When we turn to tests on variances, the picture is very different. ‘The 
n 

crucial point for normal theory in all tests on variances is that the ratioz = & (x,—%)?/o? 
i=1 


is distributed like y? with (n—1) degrees of freedom. If we consider the sampling 
cumulants of kz = x,2/(n—1), we see from (12.35) that 


varz = ("=*)' «24 = 7 ee aad 
<- n 


Kp 


= (n—1) (24%), (31.7) 
while from £12.30) 
male) = ("—) «@ 


as 
= (=) Ke, Lanark 4A(n—2)x3 8x3 } 


Ka ) \n?' n(n—1) n(n—1)?  (n—1) 
2 
de (n—1) {4 Sty Sal, 
2 2 2 


and similarly for higher moments from (12.37-39). ‘These expressions make it obvious 
that the distribution of z depends on all the (standardized) cumulant ratios «3/«3, 
x,/«2, etc., and that the terms involving these ratios are of the same order in m as the 
normal theory constant terms. If, and only if, all higher cumulants are zero, so that 
the parent distribution is normal, these additional terms will disappear. Otherwise, 
(31.7) shows that even though z is asymptotically normally distributed, the large- 
sample distribution of z will not approach the normal theory ? distribution. The 
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Central Limit theorem does not rescue us here because z tends to a different normal 
distribution from the one we want. 


31.7 Because «, appears in (31.7) but «3; does not, we should expect deviations 
from mesokurtosis to exercise the greater effect on the distribution, and this is pre- 
cisely the result found after detailed calculations by Gayen (1949-1951) for the 7? 
and variance-ratio tests for variances. Box (1953) found that the discrepancies from 
asymptotic normal theory became larger as more variances were compared, and his 
argument is simple enough to reproduce here. 

Suppose that k samples of sizes n,; (i = 1,2,...,k) are drawn from populations 
each of which has the same variance x, and the same kurtosis coefficient y, = K4/k3. 

From (31.7), we then have asymptotically for any one sample 

var (s?) = 2u3(1+ 4y2)/miy (31.8) 
where s? is the unbiassed estimator of x,. Now by the Central Limit theorem, sj is 
asymptotically normal with mean «, and variance (31.8), and is therefore distributed 
as if it came from a normal population and were based on N; = n;/(1 +372) observa- 
tions instead of n;. Thus the effect on the modified LR criterion for comparing k 
normal variances, given at (24.44), is that —2log/*/(1+4y,) and not —2log/* itself 
is distributed asymptotically as y? with k—1 degrees of freedom. 

The effects of this correction on the normal theory distribution can be quite extreme. 
We give in the table below some of Box’s (1953) computations : 


True probability of exceeding the asymptotic normal theory 
critical value for « = 0-05 


a. 3 5 10 30 


1 | 0:0056 0-0025 0-0008 0-0001 0:0°1 
0 | 0:05 0-05 0-05 0-05 0:05 

t 0-140 0-136 0-176 0-257 0:498 
2 | 0-166 0-224 0-315 0-489 0-849 


As is obvious from the table, the discrepancy from the normal theory value of 
0-05 increases with | y,|, and with k for any fixed y, 0. 


31.8 Although the result of 31.7 is asymptotic, Box (1953) shows that similar dis- 
crepancies occur for small samples. The lack of robustness in the variance test is so 
striking, indeed, that he was led to consider the criterion /* of (24.44) as a test statistic 
for kurtosis, and found its sensitivity to be of the same order as the generally-used 
tests mentioned in 30.63. 


31.9 Finally, we mention briefly that Gayen (1949-1951) has considered the robust- 
ness both of the sample correlation coefficient 7, and of Fisher’s z-transformation of r 
to departures from bivariate normality. When the population correlation coefficient p = 0, 
and in particular when the variables are independent, the distribution of r is robust, 
even for sample size as low as 11; but for large values of p the departures from normal 
theory are appreciable. ‘The z-transformation remains asymptotically normally distri- 
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buted under parental non-normality, but the approach is less rapid. ‘The mean and 
variance of z are, to order m1, unaffected by skewness in the parental marginal distribu- 
tions, but the effect of departures from mesokurtosis may be considerable ; the variance 
of z, in particular, is sensitive to the parental form, even in large samples, although the 
mean of zg slowly approaches its normal value as n increases. 

Hotelling (1961) makes a quite different approach to problems of robustness and gives 
a useful list of references. 

Huber (1964) makes a general investigation of the robustness of estimators of a loca- 
tion parameter—cf. also Bickel (1965) and Gastwirth (1966). 


Transformations to normality 


31.10 ‘The investigation of robustness has as its aim the recognition of the range 
of validity of the standard normal theory procedures. As we have seen, this range 
may be wide or extremely narrow, but it is often difficult in practice to decide whether 
the standard procedures are likely to be approximately valid or misleading. ‘Two 
other approaches to the non-fulfilment of normality assumptions have been made, 
which we now discuss. 

The first possibility is to seek a transformation which will bring the observations 
close to the normal form, so that normal theory may be applied to the transformed 
observations. ‘This may take the form discussed in 6.25-26, where we normalize by 
finding a polynomial transformation. Alternatively, we may be able to find a simple 
normalizing functional transformation—cf. 6.27-35 and Fisher’s z-transformation of the 
correlation coefficient at (16.75). ‘The difficulty in both cases is that we must have 
knowledge of the underlying distribution before we know which transformation is best 
applied, information which is likely to be obtainable in theoretical contexts like the 
investigation of the sampling distribution of a statistic, but is harder to come by when 
the distribution of interest is arising in experimental work. 

Fortunately, transformations designed to stabilize a variance (i.e. to render it inde- 
pendent of some parameter of the population) often also serve to normalize the distri- 
bution to which they are applied—Fisher’s z-transformation of 7 is an example of this. 
Exercise 16.18 shows how a knowledge of the relation between mean and variance in 
the underlying distribution permits a simple variance-stabilizing transformation to be 
carried out. Such transformations are most commonly used in the Analysis of Variance, 
and we postpone detailed discussion of them until we treat that subject in Volume 3. 


Distribution-free procedures 


31.11 The second of the alternative approaches mentioned at the beginning of 
31.10 is a radical one. Instead of holding to the standard normal theory methods 
(either because they are robust and approximately valid in non-normal cases or by trans- 
forming the observations to make them approximately valid), we abandon them entirely 
for the moment and approach our problems afresh. Can we find statistical procedures 
which remain valid for a wide class of parent distributions, say for all continuous 
distributions ? If we can, they will necessarily be valid for normal distributions, and 
our robustness will be precise and assured. Such procedures are called distribution- 
free, as we have already seen in 30.35, because their validity does not depend on the 
form of the underlying distributions at all, provided that they are continuous. 
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The remainder of this chapter, and parts of the two immediately following chapters, 
will be devoted to distribution-free methods. First, we discuss the relationship of 
distribution-free methods to the parametric-non-parametric distinction which we made 
in 22.3. 


31.12 It is clear that if we are dealing with a parametric problem (e.g. testing a 
parametric hypothesis or estimating a parameter) the method we use may or may not 
be distribution-free. It is perhaps not at once so clear that even if the problem is 
non-parametric, the method also may or may not be distribution-free. For example, 
in Chapter 30 we discussed composite tests of fit, where the problem is non-parametric, 
and found that the test statistic is not even asymptotically distribution-free in general 
when the estimators are not the multinomial ML estimators. Again, if we use the 
sample moment-ratio b, = m,/m% as a test of normality, the problem is non-parametric 
but the distribution of 5, is heavily dependent on the form of the parent. 

However, most distribution-free procedures were devised for non-parametric prob- 
lems, such as testing whether two continuous distributions are identical, and there is 
therefore a fairly free interchangeability of meanings in the terms “ non-parametric ”’ 
and “ distribution-free ” as used in the literature. We shall always use them in the 
quite distinct senses which we have defined: “ non-parametric” is a description of 
the problem and “ distribution-free ” of the method used to solve the problem. 


Distribution-free methods for non-parametric problems 
31.13 The main classes of non-parametric problems which can be solved by 
distribution-free methods are as follows: 


(1) The two-sample problem 
The hypothesis to be tested is that two populations, from each of which 
we have a random sample of observations, are identical. 
(2) The k-sample problem 
This is the generalization of (1) to k > 2 populations. 
(3) Randomness 
A series of n observations on a single variable is ordered in some way 
(usually through time). The hypothesis to be tested is that each observation 
comes independently from the same distribution. 
(4) Independence in a bivariate population 
The hypothesis to be tested is that a bivariate distribution factorizes into 
two independent marginal distributions. 


These are all hypothesis-testing problems, and it is indeed the case that most distri- 
bution-free methods are concerned with testing rather than estimation. However, we 
can find distribution-free 


(1a) Confidence intervals for a difference in location between two otherwise identical 
continuous distributions, 
(5) Confidence intervals and tests for quantiles, 
and (6) Tolerance intervals for a continuous distribution. 
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In Chapter 30, we have already discussed 


(7) Distribution-free tests of fit 
and (8) Confidence intervals for a continuous distribution function. 


The categories listed above contain the bulk of the work done on distribution-free 
methods so far, although they are not exhaustive, as we shall see. 
A very full bibliography of the subject is given by Savage (1962). 


31.14 The reader will probably have noticed that problems (1) to (3) in 31.13 
are all of the same kind, being concerned with testing the identity of a number of 
univariate continuous distributions, and he may have wondered why problem (4) has 
been grouped with them. ‘The reason is that problem (4) can be modified to give 
problems (1) to (3). We shall indicate the relationship here briefly, and leave the 
details until we come to particular tests later. 

Suppose that in problem (3) we numerically label the ordering of the variable x and 
regard this labelling as the observations on a variable y. Problem (3) is then reduced 
to testing the independence of x and the label variable y, 1.e. to a special case of prob- 
lem (4). Again in problem (4), suppose that the range of the second variable, say g, 
is dichotomized, and that we score y = 1 or 2 according to which part of the dichotomy 
an observed z falls into. If we now test the independence of « and y, we have reduced 
problem (4) to problem (1), for if x is independent of the y-classification, the distribu- 
tions of x for y = 1 and for y = 2 must be identical. Similarly, we reduce problem 
(4) to problem (2) by polytomizing the range of z into k > 2 classes, scoring y = 1, 2, 
...,k, and testing the independence of x and y. 


The construction of distribution-free tests 


31.15 How can distribution-free tests be constructed for non-parametric prob- 
lems? We have already encountered two methods in our discussion of tests of fit in 
Chapter 30: one was to use the probability integral transformation which for simple 
hypotheses yields a distribution-free test ; the second was to reduce the problem to 
a multinomial distribution problem, as for the X? test—we shall see in the next chapter 
that this latter device in its simplest form serves to produce a test (the so-called Sign 
Test) for problem (5) of 31.13. But important classes of distribution-free tests for 
problems (1) to (4) rest on a different foundation, which we now examine. 

If we know nothing of the form of the parent distributions, save perhaps that they 
are continuous, we obviously cannot find similar regions in the sample space by the 
methods used for parametric problems in Chapter 23. However, progress can be 
made. First, we make the necessary slight adjustments in our definitions of sufficiency 
and completeness. 

In the absence of a parametric formulation, we must make these definitions refer 
directly to the parent d.f.; whereas previously we called a statistic ¢ sufficient for the 
parameter @ if the factorization (17.68) were possible, we now define a family C of 
distributions and let 6 be simply a variable indexing the membership of that family. 
With this understanding, ¢ is called sufficient for the family C if the factorization (17.68) 


holds for all 6. Similarly, the definitions of completeness and bounded completeness 
HH 
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of a family of distributions in 23.9 hold good for non-parametric situations if 6 is taken 
as an indexing variable for members of the family. 


31.16 Now we have seen in Examples 23.5 and 23.6 that the set of order-statistics 
t = (x1), X@--+» Mm) is a sufficient statistic in some parametric problems, though 
not necessarily a minimal sufficient statistic. It is intuitively obvious that ¢ will always 
be a sufficient statistic when all the observations come from the same parent distribu- 
tion, for then no information is lost by ordering the observations. (It is also obvious 
that it will be minimal sufficient if nothing at all is known about the form of the parent 
distribution.) Now if the parent is continuous, we have observed in 23.5 that similar 
regions can always be constructed by permutation of the co-ordinates of the sample 
space, for tests of size which is a multiple of (m!)-'. Such permutation leaves the set 
of order-statistics constant. If nothing whatever is known of the form of the parent, 
it is clear that we cannot get similar regions in any other way. ‘Thus the result of 
23.19 implies that the set of order-statistics is boundedly complete for the family of 
all continuous d.f.s.(*) 

We therefore see that if we wish to construct similar tests for hypotheses like those 
of problems (1)-(4) of 31.13, we must use permutation tests which rest essentially on 
the fact, proved in 11.4 and obvious by symmetry, that any ordering of a sample from 
a continuous d.f. has the same probability (!)-1._ There still remains the question of 
which permutation test to use for a particular hypothesis. 


The efficiency of distribution-free tests 

31.17 The search for distribution-free procedures is motivated by the desire to 
broaden the range of validity of our inferences. We cannot expect to make great 
gains in generality without some loss of efficiency in particular circumstances ; that is 
to say, we cannot expect a distribution-free test, chosen in ignorance of the form of 
the parent distribution, to be as efficient as the test we would have used had we known 
that parental form. But to use this as an argument against distribution-free procedures 
is manifestly mistaken: it is precisely the absence of information as to parental form 
which leads us to choose a distribution-free method. ‘The only “ fair” standard of 
efficiency for a distribution-free test is that provided by other distribution-free tests. 
We should naturally choose the most efficient such test available. 

But in what sense are we to judge efficiency ? Even in the parametric case, UMP 
tests are rare, and we cannot hope to find distribution-free tests which are most power- 
ful against all possible alternatives. We are thus led to examine the power of distri- 
bution-free tests against parametric alternatives to the non-parametric hypothesis 
tested. Despite its paradoxical sound, there is nothing contradictory about this, and 
the procedure has one great practical virtue. If we examine power against the alter- 
natives considered in normal distribution theory, we obtain a measure of how much we 
can lose by using a distribution-free test if the assumptions of normal theory really 
are valid (though, of course, we would not know this in practice). If this loss is small, 
we are encouraged to sacrifice the little extra efficiency of the standard normal theory 


(*) That it is actually complete is proved directly, e.g. by Lehmann (1959) ; the result is due 
to Scheffé (1943b). 
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methods for the extended range of validity attached to the use of the distribution-free 
test. 

We may take this comparison of normal theory tests with distribution-free tests a 
stage further. In certain cases, it is possible to examine the relative efficiency of the 
two methods for a wide range of underlying parent distributions ; and it should be 
particularly noted that we have no reason to expect the normal theory method to main- 
tain its efficiency advantages over the distribution-free method when the parent dis- 
tribution is not truly normal. In fact, we might hazard a guess that distribution-free 
methods should suffer less from the falsity of the normality assumption than do the 
normal theory methods which depend upon that assumption. Such few investiga- 
tions as have been carried out seem on the whole to support this guess. 


Tests of independence 


31.18 We begin our detailed discussion of distribution-free tests for non-parametric 
hypotheses, which will illustrate the general points made in 31.15-17, with problem (4) 
of 31.13—the problem of independence. 

Suppose that we have a sample of pairs (x,y) from a continuous bivariate dis- 
tribution function F(x, y) with continuous marginal distribution functions G(x), H (y). 
We wish to test 

Hy: F(x,y) = G(x)H(y), all x,y. (31.9) 
Under Hj, every one of the ! possible orderings of the x-values is equiprobable, and 
independently of x, so is every one of n! y-orderings; we therefore have (n!)? equi- 
probable points in the sample space. Since, however, we are interested only in the 
relationship between x and y, we are concerned only with different pairings of the nx’s 
with the 2 y’s, and there are ! distinct sets of pairings (obtained, e.g. by keeping the y’s 
fixed and permuting the x’s) with equal probabilities (n!)-1. From 31.16, all similar 
size-a tests of H, contain «nm! = N of these pairings (N assumed a positive integer). 

Each of the m! sets of pairings contains n values of (x,y) (some, of course, may 
coincide). The question is now: what function of the values (x,y) shall we take as 
our test statistic? Consider the alternative hypothesis H, that x and y are bivariate 


normally distributed with non-zero correlation parameter p. We may then write the 
Likelihood Function, by (16.47) and (16.50), as 


O» 0, Oy 


=e 2 2 2 
+ (Es) + (G*eteee4 8) } (31.10) 
Oy 0; CO, Oy Oy 


Now changes in the pairings of the x’s and y’s leave the observed means and vari- 
ances *, J, s3, s¥, unchanged. The sample correlation coefficient r, however, is affected 


by the pairings through the term = x;y; in its numerator. Evidently, (31.10) will be 
i=1 


largest for any p > 0 when r is as large as possible, and for any p < 0 when r is as small 
as possible. By the Neyman—Pearson lemma of 22.10, we shall obtain the most power- 
ful permutation test by choosing as our critical regions those sets of pairings which 
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maximize (31.10), for when H, holds, all pairings are equiprobable. ‘Thus considera- 
tion of normal alternatives leads to the following test, first proposed on intuitive grounds 
by Pitman (1937b) : reject H, against alternatives of positive correlation if r is large, 
against alternatives of negative correlation if r is small, and against general alternatives 
of non-independence if |7| is large. ‘The critical value in each case is to be deter- 
mined from the distribution of r over the ! distinct sets of pairings equiprobable 
on H,. 

Although Pitman’s correlation test gives the most powerful permutation test of 
independence against normal alternatives, it is, of course, a valid test (i.e. it is a strictly 
size-« test) against any alternatives, and one may suppose that it will be reasonably 
powerful for a wide range of alternatives approximating normality. 


The permutation distribution of r 


31.19 Since 
1 n 
2 € 53 v9 89) | S50 (31.11) 
Nj 


only Xx,y; is a random variable under permutation. We can obtain its exact dis- 


tribution, and hence that of 7, by enumeration of the 2! possibilities, but this becomes 
too tedious in practice when m is at all large. Instead, we approximate the exact 
distribution by fitting a distribution to its moments. We keep the y’s fixed and per- 
mute the x’s, and find 3 
whence, from (31.11), 
E(r) = 0. (31.12) 
For convenience, we now measure from the means (*,7). We have 
var (Lx;yi) = Byj var x, +X Uy; y; cov (x, %) 
i i i+) 


a wets bs Ee an POSE Veen 


I 


" 
NS) s+ (yi) 2H} a teear-2a) | 
= ns?st+ns*s*/(n—1) 


2 52 52 /(n— 1). 
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Thus (31.11) gives 
varr = (n' ss) 7) varizxy) = 1/(— 1). (31.13) 
The first two moments of 7, given by (31.12) and (31.13), are quite independent 
of the actual values of (x,y) observed. By similar expectational methods, it will be 


found that 
HWS Rs fe 
oO) = nGa—1) ey (ae) “ae 


EB) = nz {+ Gap (a) (ae 
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where the k’s are the k-statistics of the observed x’s and the k’’s the k-statistics of the 
y’s. Neglecting the differences between &-statistics and sample cumulants, we may 
rewrite (31.14) as 
es (1 —2) , 
E(?) = Fayed 
a (n—2)(n—3)__, 
hip =. 4 I = : 
(7*) ° n2—1 { YF 3n(n—1)2 §282 
where 2, £2 are the measures of skewness and kurtosis of the x’s, and gj, g, those of 
the y’s. If these are fixed, (31.14) may be written 


(31.15) 


E(r?) = O(n-*), | 
3 (31.16) 
1 =i 
E(r*) ito )}. 
Thus, as 2—> oo, we have approximately 
E(r*) = 0, 
E(r') = re (31.17) 
n*—1 


The moments (31.12), (31.13) and (31.17) are precisely those of (16.62), the sym- 
metrical exact distribution of 7 in samples from a bivariate normal distribution with 
p = 0, as may easily be verified by integration of r? and r* in (16.62). Thus, to a 
close approximation, the permutation distribution of r is also 


1 
F — ; = — y2\3(n—4) se 
EEG Daye Oe Paint A 184 (31.18) 


and we may therefore use (31.18), or equivalently the fact that t = {(n—2)r2/(1—r?)}3 
has a “ Student’s ” distribution with (m—2) degrees of freedom, to carry out our tests 
on r. (31.18) is in fact very accurate even for small n, as we might guess from the 
exact agreement of its first two moments with those of the permutation distribution. 


The convergence of the permutation and normal-theory distributions to a common 
limiting normal distribution has been rigorously proved by Hoeffding (1952). 


31.20 It may at first seem surprising that the distribution-free permutation distri- 
bution of 7, which is used in testing the non-parametric hypothesis (31.9), should 
agree so closely with the exact distribution (16.62) which was derived on the hypothesis 
of the independence and normality of x and y. But the reader should observe that 
the adequacy of the approximation to the third and fourth moments of the permuta- 
tion distribution of r depends on the values of the g’s in (31.15): these will tend to be 
small if F(x, y) is near-normal. In fact, we are now observing from the other end, 
so to speak, the phenomenon mentioned in 31.9, namely the robustness of the distribu- 
tion of r when p = 0. 

But if the virtual coincidence of the permutation distribution with the normal- 
theory distribution is not altogether surprising, it is certainly very convenient and 
satisfying, since we may continue to use the normal-theory tables (here of “ Student’s ” t) 
for the distribution-free test of the non-parametric hypothesis of independence. 
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Rank tests of independence 

31.21. A minor disadvantage of r as a test of independence, briefly mentioned 
below (31.11), is that its exact distribution for small values of n (say 2 = 5 to 10) is 
very tedious to enumerate. The reason for this is simply that the exact distribution 
of r depends on the actual values of (x,y) observed, and these are, of course, random 
variables. Despite the excellence of the approximation to the distribution of r by 
(31.18), it is interesting to inquire how this difficulty can be removed—it is also useful 
in other contexts, for the approximation to a permutation distribution is not always 
quite so good. 

The most obvious means of removing the dependence of the permutation distri- 
bution upon the randomly varying observations is to replace the values of (x,y) by 
new values (X, Y) (with correlation coefficient R) so determined that the permutation 
distribution of R is the same for every sample (although of course R itself will vary 
from sample to sample). We thus seek a set of conventional numbers (X, Y) to replace 
the observed (x,y). How should these be chosen? (X, Y) must not depend upon 
the actual values of (x,y), but evidently must reflect the order relationships between 
the observed values of « and y, since we are interested in the interdependence of the 
variables. We are thus led to consider functions of the ranks of x and y. We define 
the rank of y; as its position among the order statistics ; Le. 

rank {yi } nore 

We are reinforced in our inclination to consider tests based on ranks (otherwise 
called “rank order tests” or simply ‘“‘ rank tests”) by the fact that the ranks are in- 
variant under any monotone transformations of the variables. Any such transforma- 
tion will also leave the hypothesis of independence (31.9) invariant, and the ranks are 
therefore natural quantities to use. We have still not settled which functions of the 
ranks are to be used as our numbers (X, Y); the simplest obvious procedure is to use 
the ranks themselves, i.e. to replace the observed values x by their ranks among the «’s, 
and the observed y’s by their ranks. 


31.22 If we do this, we calculate the correlation coefficient R between m pairs 
(X, Y), where (Xj, Xo,---, X,) is a permutation of the first natural numbers, and 
@ ore Peer e Fie es: such permutation. In obtaining the permutation dis- 
tribution of R, we may hold the Y’s fixed and permute the X’s as before, since there 
are only 2! distinct and equiprobable sets of pairings of (X, Y). We may thus without 
loss of generality arrange the pairs of any sample so that the ranks Y are in the natural 
order 1,2,...,7. If the rank X corresponding to the value Y = 11s denoted by X;, 
we therefore have for the rank correlation coefficient 


R= E EiX- (msiy}| / fe) (31.19) 


for the mean of the first m natural numbers is $(n+1) and their variance 3’; (m?— 1). 
R is usually called Spearman’s rank correlation coefficient, after the eminent psycholo- 
gist who first introduced it over fifty years ago as a substitute for ordinary product- 
moment correlation ; it is usually given the symbol r,, which we shall now use for it. 
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Since 
x iX, = 4n(n+1)(2n4+1)—3 2 (X;-1)?, (31.20) 
i=1 i=1 
r, may alternatively be defined by 
r,=1- a sa fees (X;—1)?, (31.21) 
which is usually more convenient for calculation. 


31.23 Since the formulae (31.12—14) for the exact moments of 7 hold for arbitrary 
x and y, they hold for r, defined by (31.21) in particular. Moreover, the natural 
numbers have all odd moments about the mean equal to zero by symmetry. This 
implies that the exact distribution of r, is symmetrical and hence its odd moments are 
zero. If we substitute also for ky, k, in (31.14), we obtain for the exact moments 


E(t) = £(r,) = 0, 
1 
peas (31.22) 
= = 12(n—2)(n—3) 
BO) = ait ame ie S 


However, as indicated by the introductory discussion in 31.21, the exact distribution 
of r, can actually be tabulated once for all. Kendall (1962) gives tables of the frequency 
function of (X;—7)*, the random component of r, in (31.21), for n = 4(1)10. (The 


“tail? entries in Kendall’s tables are reproduced in the Biometrika Tables.) Beyond 
this point, the approximation by (31.18) is adequate for practical purposes, as is shown 
by the following table comparing exact and approximate critical values of r, for test 
sizes a = 0-05, 0-01 and m = 10. 


Comparison of exact and approximate critical values 


of rs for n = 
Exact critical values Approximate critical values 
Two-sided test (from Kendall (1955)) from (31.18) 
aa 05 : 0-648 0-632 
a = O-Of: 0-794 0-765 


31.24 We chose r, from among the possible rank tests of independence on grounds 
of simplicity ; clearly any reasonable measure of the correlation between x and y, based 
on their rank values, will give a test of independence. Daniels (1944) defined a class 
of correlation coefficients which includes the ordinary product-moment correlation 
as well as those based on ranks, and went on (Daniels, 1948) to show that these are all 
essentially coefficients of disarray, in the sense that if a pair of values of y are inter- 
changed to bring them into the same order as the corresponding values of x, the value 
of any coefficient of this class will increase. Let us consider the question of measuring 
disarray among the ranks of x and y. 

Suppose, as in 31.22, that the ranks of y (which are there called Y) are arrayed in 
the natural order 1, 2,..., 2 and that the corresponding ranks of x are X,, Xo,..., Xn; 
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a permutation of 1,2,...,m. A natural method of measuring the disarray of the 
x-ranks, i.e. the extent of their departure from the order 1, 2,...,m, is to count the 
number of inversions of order among them. For example, in the x-ranking 3214 for 
n = 4, there are 3 inversions of order, namely 3-2, 3-1, 2-1. The number of such 
inversions, which we shall call O, may range from 0 to $n(n—1), these limits being 
reached respectively if the x-ranking is 1,2,...,m and n,(n—1),...,1. We may 
therefore define a coefficient 


pS tae (31.23) 


which is symmetrically distributed on the range (—1, +1) over the m! equiprobable 
permutations, and therefore has expectation 0 when (31.9) holds. 


The coefficient (31.23) had been discussed by several early writers (Fechner, Lipps) 
around the year 1900 and subsequently by several other writers, notably Lindeberg, in 
the 1920’s (historical details are given by Kruskal (1958) ), but first became widely used 
after a series of papers by M. G. Kendall starting in 1938 and consolidated in a mono- 
graph (Kendall, 1962) to which reference should also be made on questions concerning 
the use of t and r; as measures of correlation. Here we are concerned only with their 
properties as distribution-free tests of (31.9). 


31.25 The distribution of ¢, or equivalently of the number of inversions QO, over 
the m! equiprobable x-rankings is easily established by the use of frequency-generating 
functions. Let the frequency function of Q in samples of size be f(Q, ”)/n!. We 
may generate the 7! x-rankings for sample size m from the (n— 1)! for sample size (n— 1) 
by inserting the new rank “‘”’ in every possible position relative to the existing (n—1). 
(Thus, e.g., the 2! rankings for n = 2 

IZ 
21 
become the 3! rankings for n = 3 
Jie 132-1 
S24°° 2235 Fe 293.) 


In any ranking, the addition to Q brought about by this process is exactly equal to 
the number of ranks to the right of the point at which ‘‘”” is inserted. Any value 
of O in the n-ranking is thus built up as the sum of m terms, each of which had a differ- 
ent value of O in the (n—1)-ranking. ‘This gives the relationship 


f(Q,m) = f(Q,n—1)+f(Q—1,n—1) +f(Q—-2,n—-I)+... 
+f(Q—(n—1),n—1). (31.24) 
Now, if f(Q,m) is the coefficient of 6° in a frequency-generating function G (6,7), 
(31.24) implies that 


G(6,n) = G(0,n—1)+0G(0,n—1)+6°G(6,n—1)+...4+6"-!G(0,n—1) 
= se G(0,n—1). (31.25) 


Applying (31.25) repeatedly, we find 
61) /6"-1-1\ (68-1 
G(6,n) = (Fa) (=) gs (3) G(6,2), (31.26) 
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and since we see directly that | 
62-1 


(0,2) =-1.0°+1.0' = ae 
(31.26) may be written 
G(6,n) = Tl nl (G—> a1) (31.27) 


We obtain the characteristic function of O by inserting the factor (n!)-1 and re- 
placing 6 by exp(76) in (31.27), so that 


$(0) = {n!(e®—1)"}-1 II (e%—1), (31.28) 
s=1 
The c.g.f. of O is therefore 
y (0) = > log (e — 1) —n log (e— 1) —log (n!). (31.29) 
s=1 
If we substitute 
e'% —] = 8/22 sinh (42 Os) 
everywhere in (31.29), we reduce it to 
y(0) = 3i0( 2 s-n)+ & log sinh (376s) — 2 log sinh ($26) — log (n!) 
8 s=1 


: Ol ae 
Siete Ayia Sept ee ntl SO (31.30) 
s=1 310s 470 
and, using (3.61), (31.30) becomes 
B,,; (0) 
y(0) = 4in(n- Ss z 23%)! PIGe 1), (31.31) 


where the B,,; are the (non-zero) aa: Bernoulli numbers defined in 3.25. 
Picking out the coefficients of (70)”/(2j)! in (31.31) we have, for the cumulants 


CE 


Ky = gn(n—1), Kyi, =0, sf 2], 31.32 
Ko = 2 ( 2s%-2) ss 


From (31.23), this gives for the cumulants of the rank correlation statistic ¢ itself 


K2j+1 = 0, j2 0, 


a 


Thus, ¢ is symmetrically distributed about zero and 


vari = ee iy in n+ 1) n+ 1) n} 
_ 2(2n+5) 


at mien (31.34) 
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31.26 Further, (31.33) shows that x; is of order n—-” Ss% in m. Since the sum- 
: 


s§= 
mation is of order n%J+1, this means that 


K 94 = Of”) 
and hence the standardized cumulants 
ee? 1-3 
(x2) oe 
Thus 
Jim (oi BOs 4, (31.35) 


and hence the distribution of t tends to normality with mean zero and variance given 
by (31.34). The tendency to normality is extremely rapid. Kendall (1962) gives 
the exact distribution function (generated from (31.24)) for m = 4(1)10. Beyond this 
point, the asymptotic normal distribution may be used with little loss of accuracy. 


31.27. In 31.24 we arrived at the coefficient ¢ by way of the realization that the 
number of inversions QO is a natural measure of the disarray of the x-ranking. If one 
thinks a little further about this, it seems reasonable to weight inversions unequally ; 
e.g. in the x-ranking 24351, one feels that the inversion 5-1 ought to carry more weight, 
because it is a more extreme departure from the natural order t. 2... eee 
inversion 4-3. A simple weighting which suggests itself is the distance apart of the 
ranks inverted; in the immediately preceding instance, this would give weights of 
4 and 1 respectively to the two inversions. Thus, if we define 


Sofebil if Ay > Ay 
eae { 0 otherwise, es 


we now seek to use the weighted sum of inversions 


V = Xh,;(j—-1) (31.37) 
i<j 
instead of our previous sum of inversions 
O = DER. (31.38) 
i<j 


However, use of (31.37) leads us straight back to r,. We leave it to the reader to 
prove in Exercise 31.5 that 


tae oe ge (31.39) 
i=1 
so that, from (31.21), 
12V 
Ys — ~ n(n?—1) (31.40) 


which is a definition of r, analogous to (31.23) for tf. 


31.28 Despite the apparently very different methods they use of weighting inver- 
sions, it is a remarkable fact that Q and V of (31.37-38), and hence the statistics ¢ 
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and r; also, are very highly correlated when the hypothesis of independence (31.9) 
holds—the reader is left to obtain the actual value of their correlation coefficient in 
Exercise 31.6. It declines from 1 at m = 2 (when ¢ and r, are equivalent) to its mini- 
mum value of 0:98 at m = 5, and then increases towards 1 asn—>oo. Thus the tests 
are asymptotically equivalent when H, holds, and this, together with the result of 
25.13, implies that, from the standpoint of asymptotic relative efficiency, both tests 
possess the same properties. Daniels (1944) showed that the limiting joint distribu- 
tion of ¢ and r, when H, holds is bivariate normal. 


31.29 In samples from a bivariate normal population, the high correlation between 
t and rs; persists even when the parent correlation coefficient p #0; S. T. David et al. 
(1951) show that as 2 —> o, tand7r; have a correlation which tends to a value > 0-984 if | p | 
< 0:8, and to 0:937 when p = 0°9, 

Hoeffding (1948a) showed that ¢t and 7; are quite generally asymptotically distributed 
in the bivariate normal form, but that their correlation coefficient depends strongly on 
the parent bivariate distribution and may indeed be zero. 


The efficiencies of tests of independence 


31.30 We now examine the asymptotic relative efficiencies (ARE) of the three 
tests of independence so far considered, relative to the ordinary sample correlation 
coefficient 7, when the alternative hypothesis is that of bivariate normality as at (31.10). 
By the methods of 23.27-36, we see that r gives a UMPU test of p = 0 against one- 
sided and two-sided alternatives—the reader is asked to verify this in Exercise 31.21. 
Since by 31.19 the permutation test based on 7 is asymptotically equivalent to the 
normal-theory r-test for independence, we see that its ARE will be 1 compared to 
that test. : 


31.31 We now derive the ARE of the test based on t defined at (31.23). From 
the definition at (31.36) we see that 


hy = 3{1—sgn(x;—4%,) sgn(y:—4,)}, 
and since there are $7(m—1) terms in O = X&&h,;, we have for their mean 
i<j 
E aes = E(hys) = 3{1—E [sgn (x; —x;) sgn (yi—y;) ]}, (31.41) 

3n(n—1) 

which from (31.23) gives 
So ee eee =e = 
E(t) = 1-26 fe ee Or it 


Now if the parent distribution / of x and y is bivariate normal with correlation para- 
meter p, so is that of w = (x;—x,;) and z = (y,;—¥4,). Thus 


E(®) = Elsen (%—ay)sen(ys-y)] = [~ |" snwegnadF 


which on applying (4.8) becomes 


e141 f* san8 © sint,z 
x = dt, | : at,\ ar, 
par ty ; -o ‘fb, : 
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which may be rewritten 
P7% 27° ES dt, dt 
= rh Rad ted? 3 
= "Af Jo ep @awtinayd \e i, (31.42) 
The inner double integral in (31.42) is the c.f. of F, which is 
p(t, t.) = exp{—$(H+H+2pt,t,) }. 
If we insert this and differentiate the remaining double integral with respect to p, we 


find 


SEQ) = = : | er ee = (31.43) 
p —« J —a 


70 


But the double integral on the right of (31.43) is simply evaluated as 
| exp{-140-093| | exp {—H (t+ pta)?}dts| dy = 2n/(1—p?). 
Thus (31.43) becomes 


0 2 
ap) = (py 
so that 
0 2 
ar ==, :: 
Paola : (31.44) 
Also, from (31.34), 
4 
var (¢ | Hy) ~ a> (31.45) 
while for the ordinary correlation coefficient 7, from (26.31), 
£0) Sy (31.46) 
Op p=0 
and from 31.19 
1 
we 1. 
varr~ © (31.47) 


Using (25.27) with m = 1, 6 = 4, the results (31.4447) give, for the ARE of t com- 
pared to 7, 


A,» = 9/72. (31.48) 


By the remark of 31.28, (31.48) will hold also for the ARE of r, compared to 7, a result 
due originally to Hotelling and Pabst (1936). 


31.32 Apart from the results of 31.30-31 against bivariate normal alternatives, little 
work has been done on the efficiencies of tests of independence, largely due to the difficulty 
of specifying non-normal alternatives to independence. A notable exception is the 
paper by Konijn (1956), which considers a class of alternatives to independence generated 
by linear transformations of two independent variables. He finds, as above, that ¢ and r, 
are often asymptotically equivalent tests, each having ARE close to that of the test based 
on the sample correlation coefficient 7, equal to it or even (in case of an underlying double- 
exponential distribution) exceeding it. 
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31.33 A defect of all the tests we have considered is that they will not be consistent 
tests against any departure from the hypothesis of independence (31.9). ‘To see this 
we need only remark that each is essentially based on a correlation coefficient of some 
kind, whose distribution will be free of location and scale parameters but will depend 
on the population correlation coefficient p. For departures from independence implying 
p #0, these tests will be consistent. But it is perfectly possible in non-normal cases to 
have non-independence accompanied by p = 0 (cf. 26.6), and we cannot expect our tests 
to be consistent against such alternatives. With this in mind, Hoeffding (1948b) pro- 
posed another distribution-free test of (31.9) which is consistent against any continuous 
alternative bivariate distribution with continuous marginal distributions. Hoeffding 
tabulates the distribution of his statistic for 2 = 5, 6, 7 and obtains its limiting c.f. and its 
cumulants. (The limiting d.f. is given by Blum et al. (1961).) He also proves that 
against this class of alternatives no rank test of independence exists which is unbiassed 
for every test size « = M/n! However, if randomization is permitted in the test function, 
Lehmann (1951) shows that generally unbiassed rank tests of independence do exist. 


Tests of randomness against trend alternatives 
31.34 As we remarked in 31.14, problem (3) of 31.13 which is to test 

Pe = se — ... = Fi, a8 Ss, (31.49) 
where we have an observation from each of m continuous distributions ordered accord- 
ing to the value of some variable y, is equivalent to testing the independence of the x’s 
and the y’s. Thus any of our tests of independence may be used as a test of random- 
ness. However, since the y-variable is not usually a random variable but merely 
a labelling of the distributions (through time or otherwise), any monotone transforma- 
tion of y would do as well as y itself. It is therefore natural to confine our attention 
to rank tests of randomness, since the ranks are invariant under monotone trans- 
formation, which leaves the hypothesis (31.49) unchanged. 

Mann (1945) seems to have been the first to recognize that a rank correlation 
statistic could be used to test randomness as well as independence and proposed the 
use of ¢ (although of course 7, could be used just as well) against the class of alternatives 

MH, 2 ie) = Fie =... < Fis), 2X, (31.50) 
where the observations x; remain independent. 

Since (31.50) states that the probability of an observation falling below any fixed 
value increases monotonically as we pass along the sequence of m observations, it may 
be described as a downward trend alternative. ‘The critical region for a size-« test 
therefore consists of the 100« per cent largest values of QO, the number of inversions 


defined at (31.38). 


31.35 (31.50) implies that for 7 <7 


P {h;; = 1} as P{X; = X;} = 44+; =< Ei3 4, (31.51) 
We thus have, from (31.38), | == 
E(O| Hy) =40(n—1)4EDe, = da(n—)+ So; (31.52) 

$3 


where S, is the sum of the $u(m—1) values ¢,;. 
Now consider the variance of Q. 


var(O|H,) = var{ = Ay} = D var(hy)+ & = cov(hi, hy). (31.53) 
<3 ‘<7 t<jk<l 
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The covariance terms in (31.53) are of two kinds. Those involving four distinct 
suffixes are all zero, since the variables are then independent, and there are (4) such 


terms. The remaining terms are non-zero and involve three distinct suffixes only, 
(i,j) and (k, 1), having one suffix in common. ‘The number of such terms is of order 


(3). the number of ways of selecting three suffixes from m. Since there are only (3) 
terms in the first summation of (31.53), we may therefore write 
var(O| H,) = O(#*). (31.54) 

31.26 shows that O is asymptotically normally distributed when H, holds, and thus 
the critical region of the test consists asymptotically of the values of Q exceeding the 
value 

Oy = 4n(n—1)+d, {gn (n—1) (2n+5)}3 (31.55) 
where the term in braces in (31.55) is the variance of Q (obtained from (31.34) and 
(31.23)) and d, is the appropriate standardized normal deviate. 


31.36 From (31.52) and (31.55), we see that 
P{O > Q,| Hy} ~ P{Q-—E(Q|H,) > d,[¥en(n—1)(2n+5)}*—S,,| Hi}. (31.56) 
Using (31.54), we may write (31.56) asymptotically as 
P{O > QOo|H,}~ P{Q-E(Q|H,) > [var(Q|H,)P[d,—cn-*?S,]}, (31.57) 
where c is some constant. We now impose the condition that 
pS sO (31.58) 
18 2 nen 
A= 4,—ta-' 5, —* @ (31.59) 
and 4 will be negative when n is large enough. By T’chebycheff’s inequality (3.94), 
we have a fortiori for negative 2 and any random variable «, 
P{x—E(x) > A(varx)#} > 1-5. (31.60) 
Thus, when (31.58) holds, (31.57), (31.59) and (31.60) give 
n—> oO 


Thus the test of randomness is consistent provided that (31.58) holds. This is a 
rather mild requirement, for there are 4n(n—1) terms in S,. Thus if there is a fixed 
non-zero lower bound to the ¢,;, (31.58) certainly holds. Commonly, one wishes to 
consider alternatives for which ¢;; is a function of the distance |z—7| only; if it is 
an increasing function of this distance, (31.58) certainly holds. 

As well as deriving a more general version of this result, in which the e,; need not 
all have the same sign, Mann (1945) derived a condition for the unbiassedness of the 
test, which is essentially that given as Exercise 31.8. 


31.37 We now consider a particular trend alternative to randomness, where the 
mean of the variable x; is a linear function of 7, and its distribution is normal about 
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that mean with constant variance. This is the ordinary linear regression model with 
normal errors. We have 
x; = Bot Bit+6;, (31.61) 
where the errors 6; are independently normally distributed, and variance o? for all 2. 
The test of randomness is equivalent to testing 
H,: B, = 0 (31.62) 
in (31.61). We proceed to find the asymptotic relative efficiency (ARE) of the test 
based on ¢ (or, equivalently, on Q) compared with the standard test, based on the 
sample regression coefficient 
X(x;—*)(¢—7) _ Ux; oe ae 
b= — : 31.63 
2 (t—1) sign (n?—1) ( ) 
which is the LR test for (31.62) and (since there is only one constraint imposed by H,) 
is UMP for one-sided alternatives, say 6, < 0, and UMPU for two-sided alternatives 
B, # 0 (cf. 24.27). We put o? = 1 without loss of generality. We have, from Least 
Squares theory (cf. Examples 19.3, 19.6) 


E(b|H,) = a3 ; 
1 
Val? bie) Sa GPL 1y 
so that the ratio 
dE (b| H,) : 
tl Obs ot hte 3 Bae 3 (31.64) 
var (b| Hy) ese 9. 12 


31.38 To obtain the equivalent of (31.64) for the test based on f, we require the 
derivative of 


E(Q|H,) = Et ~ Ais} = 2 E(hj). (31.65) 


Now (x;—<;) is, from the model (31.61), normally distributed with mean f,(¢—j) and 
variance 2. Hence 


E(h,;) = P{hjy = 1} = P{x; > x;} 
= [° serexp {3 t-Puli—s) Phat 


| 
| (27)-? exp (— gu?) du. 
By(i—j)/207* 


l 


Thus 


E Eh) _- ee eel (31.66) 
From (31.65) and (31.66) 


Q ae 
[ap BOI] = 3530-8) 
1 ane 2 ee —n(nt—1) 


Ont 6 1273 


(31.67) 
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Also, from (31.34) and (31.23) 
var (O| Hy) = Ayn(n—1)(2n+5). (31.68) 
From (31.67) and (31.68) 


0 2 
{[s5-BO1HD| ggg ays , 
OB, =) Te piers 
TIE AB 1449 n(n—1)(2n+5) 40 one 


Use of (31.64) and (31.69) in (25.27) with m = 1, 6 = 3, gives for the ARE of O com- 
pared to b 


3\1/3 

a (:) = (1.98, (31.70) 

Just as before (cf. 31.28), the same result holds for the alternative coefficient r, 

(or equivalently V); the direct evaluation of the ARE of V is left to the reader as 
Exercise 31.9. 


Optimum rank tests of independence and of randomness 

31.39 It is worth remarking that the two rank correlation coefficients ¢ and r, are 
even more efficient as tests of randomness against normal alternatives than as tests of 
bivariate independence against normal alternatives, the values of ARE given by (31.70) 
and (31.48) being (3/z)!/* and (3/z)? respectively. But although both of these values 
are near 1, they are not equal to 1, and we are left with the question whether distribu- 
tion-free tests exist for these problems which have ARE of 1 compared with the best test. 

In order to answer this question, let us return to our discussion of 31.21, where 
the choice of r, from among all possible rank tests was made on grounds of simplicity. 
In effect, we decided to replace the observed variate-values x by their ranks. Now 
since the permutation test based on the variate-values themselves has ARE 1 against 
normal alternatives (cf. 31.30), we should expect to retain optimum efficiency if we 
replace the variate-values by functions of their ranks which, asymptotically, are per- 
fectly correlated with the variate-values. Suppose, then, that after ranking the x obser- 
vations, we replace them by the expected values of the order statistics in a sample of 
size n from a standardized normal distribution. These are a perfectly definite set of 
conventional numbers, usually called the normal scores; the point in using them is that 
as n —> 00, the correlation of these numbers with the variate-values will tend to 1, and 
we shall obtain optimum rank tests against normal alternatives. The test statistic is 
therefore 


=f iE(X,n)-}(n+1).2 & E(X,,n) 
= = (31.71) 


n 


Sa is ath 
Fac 15 B 4 B(Xom) = E(Xom)} | 


where X; is now the x-value corresponding to the ith largest value of y and E(s,m) 
is the expected value of x) in a sample of size m from a standardized normal distribu- 
tion. Neglecting constants, (31.71) is equivalent to testing with the statistic 


Ges ee OC, 2), (31.72) 
font 
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which therefore has ARE of 1 in testing independence or randomness against normal 
alternatives. 
Bhuchongkul (1964) confirms this result in the course of investigating the use of 
any conventional numbers in the test statistic. 
The use of the normal scores as conventional numbers was first suggested by R. A. 
Fisher and F. Yates in the Introduction to their Statistical Tables, first published in 
1938. The locally optimum properties of the test statistic (31.72) were demonstrated 
by Hoeffding (1950) and Terry (1952). A direct proof of the asymptotically perfect 
correlation between the expected values of the order statistics and the variate-values 
they replace is obtained from Hoeffding’s (1953) theorem to the effect that for any parent 
d.f. F(x) with finite mean, and any real continuous function g(x) bounded in value by 
an integrable convex function, 
n (o-8) 
lim : a g {E(m,n)} = | g(x) dF. (31.73) 
=— OO 
Successive substitution of g(x) = cosxt, g(x) = sinxt in (31.73) shows that the limiting 
c.f. of the E(m,n) is the c.f. of the distribution F(x), which is E' {cos xt+ sin xt}. 
Bell and Doksum (1965) show that if instead of the normal scores E(s, 7) we use 
simply the observed xs) in a sample of 2 random normal deviates, we get the same asymp- 
totic properties in all the contexts considered in this chapter. ‘The advantages are that 
no special table is needed, and that exact size « can be attained for tests; the disadvantage 
is that the small-sample power of these tests seems to be lower than for normal scores 
tests—Jogdeo (1966) shows that these tests have some undesirable properties. 
Brillinger (1966) points out that a set of ordered values xs) are maximally correlated 
with the values 2s) = a+bE(x,s) for any sample size. ‘This follows from the fact that 
the correlation coefficient between the xs) and 2s) equals the correlation ratio of x on its 


ordered values (cf. (26.40) and (26.45)). 


31.40 As well as seeking optimum rank tests against normal alternatives, as in 
31.39, we may also ask whether there are any alternatives for which any particular 
rank test is optimum among rank tests. We do not pursue this subject here, because 
the inquiry would be artificial from our present viewpoint (cf. 31.17), which essentially 
regards distribution-free procedures as perfectly robust substitutes for the standard 
normal-theory procedures. Our interest is therefore confined to comparisons of 
efficiency between distribution-free and standard normal-theory methods. An account 
of rank tests in general is given by Lehmann (1959) and by Fraser (1957). 


31.41 Before leaving tests of randomness, we should mention that a variety of 
such tests have been proposed in the literature, none of which is as efficient against 
normal alternatives as those we have discussed. However, some of them are con- 
siderably simpler to compute than r, or ¢, and very little less efficient. They are 
discussed in Exercises 31.10-12. Other tests have their ARE evaluated by Stuart 
(1954b, 1956). 


Two-sample tests 


31.42 We now consider problem (1) of 31.13. Given independent random 
samples of sizes ,, m, respectively from continuous distribution functions F(x), 
F(x), we wish to test the hypothesis 

dig 2? ay = Fae) a x. _ (31.74) 
II | 
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As we remarked in 31.14, this is equivalent to testing the independence of the variable x 
and a dummy variable dichotomized so that only two distinct values y arise. ‘There 
are n,+n, = n observations on the pair (x,y). 

Let us for a moment consider the z values of x as being arranged over the 7 positions 


labelled 

Lf aah oe Phe Gs (31.75) 
Under Hy, each of the m! possible orderings of the x-values is equiprobable ; but 
irrespective of whether H, holds, the 2,! permutations of the positions in the first 
sample, and the m,! permutations of the positions in the second sample, do not affect 
the allocation of the m values to the two samples. ‘Thus there are n!/(n,!m,!) distinct 


allocations to the two samples, corresponding to the (7) ways of selecting the 
1 


members of the first sample from the z values. 


31.43 For the hypothesis (31.74), unlike the others we have so far considered at 
(31.9) and (31.49), we may consider a class of alternatives much more general than 
those of standard normal theory, namely 

H,: F,(x) = Fy(x—6), all «x. (31.76) 
(31.76) states that the only difference between the two parent distributions is one of 
location. In terms of (31.76), (31.74) becomes 
Hy: 6-= 0. (31.77) 
We shall refer to (31.76) as the location-shift alternative hypothesis. It should be 
noted that although a location parameter 6 occurs in (31.76), the hypothesis (31.77) 
is non-parametric by our definition of 22.3, since the form of the parent distribution 
F(x) is unspecified. 

31.44 To suggest a statistic for testing Hy, we return to the case of normal alter- 

natives. Consider two normal distributions differing only in location. Without loss 


of generality, we assume their common variance o? to be equal to 1, and that the mean 
of the first distribution is zero. ‘The Likelihood Function is therefore 


L(e| Hy) = 2)-exp{ 3 5 ah-3 E (eu —9) } 
i=1 i=1 
= (2n)-exp{ —3 xX x7Z+6 >> var dn6* (31.78) 
i=1 i=1 
From (31.78) we see that for 0 > 0, L(x|H,) will be maximized when Dey is as 


large as possible and similarly for 6 < 0 when 3 a is as small as possible. By the 
Neyman—Pearson lemma of 22.10, the most nonce critical region will consist of 


those of the 6 :) equiprobable points in the sample space which maximize L (x | H,). 


We are thus led to use the statistic > X»;, or equivalently the mean of the second sample, 
i =1 
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Ko =— Ux; Since n,%,+n,*%. = nx, and the overall mean # is invariant under 


permutations, *, determines the value of #, also, and we may equivalently consider <, 
or %,—X,. For the two-sided alternative 6 4 0, we are inclined to use the “ equal- 
tails’ two-sided test on *%,—*, or equivalently a one-sided test on (#,—%,)?, large 
values forming the critical region. It was in this form that the test statistic was first 
investigated by Pitman (1937a). 


The permutation distribution of w 
31.45 ‘The statistic (¥,—,)? can take values ranging between zero and its maxi- 

mum value, which occurs when every member of the first sample is equal to #,, and 

every member of the second sample equals #,. We then have for the observed variance 

of the combined samples, say s?, which is invariant under permutations, 

N,N 


ns? = Ny (%1—X)*+ ny (*%.— xX)? = (#1 —#-). 
2 92 
We thus have 0 < (&,—*,)? < ae 
NN» 
If we therefore define eo = ee (¥,—X,)?, (31.79) 
we have for all possible samples 
<= = f. (31.80) 


31.46 ‘To obtain the permutation distribution of w given Ho, we write it identically 
as 


eee 
= eo (31.81) 


a form in which only *, varies under permutation of the observations. The exact 
distribution of *, may be tabulated by enumeration, but as previously remarked in 
31.19 the process becomes tedious as m increases. In the form (31.81), however, we 
may use already-developed results to obtain the moments of w, for it is a multiple of 
the squared deviation of the sample mean from the population mean in sampling n, 
members from a finite population of nm members. We found the necessary expecta- 
tions at (12.114) and (12.120), which we rewrite in our present notation as 


Peat = Oa he i ini} (31.82) 


erie: Migs 
eee iG a-DeHD 
x {3n(n,—1)(n,—1)s*+ [n(n+1)—6n,n,] m,}, 
where m, is the observed fourth moment of the combined samples. Thus 
E(w) = 1/(n—1), (31.83) 
E(@) = 


N,N (n—1)(n—2)(n—3) 
x {32 ,n,(n—6)+ 6n+ [n (n+ 1)—6n,n4] 22}, (31.84) 
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where g, is the measure of kurtosis (m,/s*)—3. When either m, or n, becomes large, 
and n with it, (31.84) is asymptotically 


3 g 
E(w?) ~ a7 {1+0(£)}, (31.85) 
where 7, is the sample size which is not large. If both n, and n,—> oo (31.84) is 
3 §2 
2 ~ 
E(w?) aj {!+0(2)}. (31.86) 


Thus, especially when g, is small, we have 


acd 
E(w?) — eee 


(31.87) 


31.47 (31.83) and (31.87) are the first two moments about the origin of the Beta 

distribution of the first kind 
1 

B {25 an =2 j 
which we may therefore expect to approximate the permutation distribution of w. 
In fact, Pitman (1937a, b) showed that the third moments also agree closely, and that 
the approximation is very good. 

Now consider the ordinary “‘ Student’s”’ ¢?-statistic for testing the difference in 
location between two normal distributions. In our present notation, we write it 
N1N¢ 


2 = (31.89) 


n—2 N,Si+Ness 


dF = wi(l—w)"2dw, O<w<1, (31.88) 


(¥1— 2)? 


where s2, s2 are the separate sample variances. Using the identity 
WER ge 
n= My Si + 983+ (%1— Ha)” 


in (31.79) and (31.89) shows that 


0. = 


——s (31.90) 


{2 
exactly. Thus we have been dealing with a monotone increasing function of t?. What 
is more, in the exact normal theory, the transformation (31.90) applied to the 
‘“‘Student’s ” distribution with y = n—2 gives precisely the distribution (31.88). (In 
fact, we carried out essentially this transformation in reducing ‘‘ Student’s ’’ distribu- 
tion function to the Incomplete Beta function in 16.11, except that there we trans- 
formed to (1—w) and obtained (31.88) with 1—w replacing w.) 

We have therefore found, exactly as in 31.19, that the approximation to the permu- 
tation distribution in testing a non-parametric hypothesis is precisely the normal- 
theory distribution. In this particular case, we may test w, from (31.90), by putting 
(n—2)w/(1—w) = t? with (n—2) degrees of freedom. For the one-sided tests dis- 
cussed at the outset, we simply use ¢ in the appropriate tail rather than 27°. 


1+ 
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31.48 The wide applicability of t? as an approximation of the normal theory in 
this instance must clearly be attributed to the operation of Central Limit effects, since 
we are dealing with a difference between means ; cf. the related discussion of robustness 
in 31.4. 

Just as we remarked in 31.30 previously, the asymptotic equivalent of the permuta- 
tion distribution to that of the optimum normal theory test implies that the former has 
ARE of 1 against normal alternatives, in this case of a location-shift. 


Distribution-free confidence intervals for a shift in location 

31.49 We may now use the test statistic w of (31.79) to obtain distribution-free 
confidence intervals for the location-shift 6 in (31.76). For, whatever the value of 0, 
(31.76) implies that the m, values x,; (¢ = 1,2,...,m,) and the m, values (x,;+ 6), 
(i = 1,2,...,m,) are two samples which come from the same distribution F; (x). 
The distribution of w given H, is therefore applicable to these two samples. 

Let us denote by w(0) the calculated value of w for the two samples, which is 
evidently a function of 0. Let w, be the upper critical value of w for a test of size «, 1.e. 


P {w (60) < w,} = 1—-«. (31.91) 
Using (31.90), (31.91) is equivalent to 
P{t#?(6) < #} = 1-2, (31.92) 


where #2 is defined at (31.89). The denominator of 7? is a function of the separate 
sample variances only, and is therefore not a function of 6. Using (31.89) in (31.92), 
we therefore have 
P{[%,—(#,+0)]? < 2} = 1-2, (31.93) 
where 
o_ 1(m1S{ +253) 49 
i eR) fe (31.94) 
Thus from (31.93) we have, whatever the true value of 6, 
P{(%,—#.)—k, < 0 < (#,—%_.)+h,} = 1-«, (31.95) 
and (31.95) is a confidence interval for 0. 
If the sample sizes are large enough for the permutation distribution of ¢? to be 
closely approximated by the exact ‘‘ Student’s ” distribution, we obtain 7% from the 
tables of the latter ; otherwise, the exact permutation distribution of w must be used, 


with (31.90). We are then, of course, limited in the values of « we may choose for our 
= 


test or confidence interval to multiples of & 
1 


Consistency of the w-test 
31.50 Using the result of the last section, we may easily see that w is a consistent 
test statistic for the hypothesis (31.77) against the alternative (31.76) with 6 ¥ 0, pro- 
vided that F(x) has a finite variance. In fact, even if F',(«) and F',(x) have different 
finite variances and different means, w remains consistent, as Pitman (1948) showed. 
Consider k?, defined at (31.94). If n,,n,—> © so that n,/n,—>1,0 <A< o, 
(Ao, +03) #2 
n a 


1 


k? converges in probability to , which is of order n[', while by the Law 
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of Large Numbers, (#,—#,) converges to E(x,)—E(x,), which is the true value of 6, 
say 9). (31.95) now shows that for any « the confidence interval 
I = (%,—*X,—k,X,—*.+k) 

is an interval, with length of order n;?, for 6. If we choose « < e, we see that for 
any 0, 4 0, 

lim P{@,el}<e (31.96) 

n> oo 

and this is merely a translation into confidence intervals terminology of the consistency 
statement to be proved, for ¢ may be arbitrarily near zero. Ultimately, as m, increases, 
the interval will exclude 6, with probability tending to 1, 1.e. the test of 0 = 0 will 
reject 0, 4 0). This argument also makes it clear that (31.77) may be replaced by 
H,:@ = 0, if we add an increment 0, to each observation in the second sample. 


Rank tests for the two-sample problem 

31.51 Just as in 31.21 during our discussion of tests of independence, so here in 
the two-sample problem we see that if we wish to be in a position to tabulate the exact 
permutation distribution of a test statistic for any m, we must remove the dependence 
of the test statistic upon the actual values of the observations, which are random vari- 
ables, and we are led to the use of rank tests, which are particularly appropriate because 
of their invariance under monotone transformation of the underlying variables, which 
leave the hypothesis (31.74) invariant. Once again, the simplest procedure is simply 
to replace the observations x, by their rank values, i.e. to rank the n,+n, = n observa- 
tions in a single sequence and replace the value x, by its rank X;. We then have 
a set of m values X,; which are a permutation of the first » natural numbers, of which 
n, belong to the first sample and n, to the second. 

Since, as we pointed out in 31.44, the statistic w is equivalent to using the mean *, 
of the first sample, the rank test obtained from w by replacing the observations by 
their ranks is equivalent to using | 


Sa 2x (31.97) 
i=] 


the sum of the ranks in the first sample, which is analogous to r, of (31.21) since both 
arise from replacing observations by ranks. 


31.52 Now suppose that we seek an analogue of t, defined by (31.23), 1.e. essen- 
tially of O as defined at (31.38). We should obviously expect, if the hypothesis (31.74) 
holds, that the observations from the first and second samples would be thoroughly 
‘“‘ mixed up ”’ with no tendency for the ranks in the first sample to cluster at either or 
both ends of the range from 1 to nm. Define a statistic U which counts the number of 
times a member of the first sample exceeds a member of the second sample, i.e. 


jg ae = (31.98) 
1 


i=1 j= 

where h;; is defined at (31.36) as before. U ranges in value from 0 to n,n. 
Whereas in the case of tests of independence there is a genuine choice between 
r, and ¢ as test statistics (although they are equivalent from the viewpoint of ARE, as 
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31.48 The wide applicability of ¢? as an approximation of the normal theory in 
this instance must clearly be attributed to the operation of Central Limit effects, since 
we are dealing with a difference between means ; cf. the related discussion of robustness 
in 31.4. 

Just as we remarked in 31.30 previously, the asymptotic equivalent of the permuta- 
tion distribution to that of the optimum normal theory test implies that the former has 
ARE of 1 against normal alternatives, in this case of a location-shift. 


Distribution-free confidence intervals for a shift in location 

31.49 We may now use the test statistic w of (31.79) to obtain distribution-free 
confidence intervals for the location-shift 6 in (31.76). For, whatever the value of 6, 
(31.76) implies that the m, values x,; (¢ = 1,2,...,m,) and the n, values (x,;+ 9), 
(i = 1,2,...,m,) are two samples which come from the same distribution F; (x). 
The distribution of w given H, is therefore applicable to these two samples. 

Let us denote by w(0) the calculated value of w for the two samples, which is 
evidently a function of 6. Let w,, be the upper critical value of w for a test of size «, 1.e. 


P {w (0) < w,} = 1-«. (31.91) 
Using (31.90), (31.91) is equivalent to 
Pi) = £)-= 1c, (31.92) 


where ¢? is defined at (31.89). The denominator of ¢? is a function of the separate 
sample variances only, and is therefore not a function of 6. Using (31.89) in (31.92), 
we therefore have 
P{[%,-(#,+0)]? < 2} = 1-2, (31.93) 
where 
g_ (1S, +253) 12 
k= era) ies (31.94) 
Thus from (31.93) we have, whatever the true value of 0, 
P{(%,—%_)—k, < 0 < (#,—*#,)+k,} = 1-«, (31.95) 
and (31.95) is a confidence interval for 0. 

If the sample sizes are large enough for the permutation distribution of 2? to be 
closely approximated by the exact “‘ Student’s ” distribution, we obtain ¢% from the 
tables of the latter ; otherwise, the exact permutation distribution of w must be used, 
with (31.90). We are then, of course, limited in the values of « we may choose for our 


—1 
test or confidence interval to multiples of & 
ony 


Consistency of the w-test 
31.50 Using the result of the last section, we may easily see that w is a consistent 
test statistic for the hypothesis (31.77) against the alternative (31.76) with 6 ¥ 0, pro- 
vided that F(x) has a finite variance. In fact, even if F',(«) and F',(x) have different 
finite variances and different means, w remains consistent, as Pitman (1948) showed. 
Consider k?, defined at (31.94). If n,,n,—> © so that m,/n,—> 1,0 <A< o, 
(Ao; +03) #2 
n a 


1 


k? converges in probability to , which is of order n,', while by the Law 
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of Large Numbers, (*«,—#,) converges to E(x,)— E(x), which is the true value of 6, 
say 0). (31.95) now shows that for any « the confidence interval 
T = (%,—#.—k, #,-—*%,+h) 
is an interval, with length of order n;?, for 6. If we choose « < e, we see that for 
any 0, # 0, 
lim P{@,el}<e (31.96) 

n,—> | 
and this is merely a translation into confidence intervals terminology of the consistency 
statement to be proved, for e may be arbitrarily near zero. Ultimately, as m, increases, 
the interval will exclude 0, with probability tending to 1, i.e. the test of 6 = 0) will 
reject 0, # 0). This argument also makes it clear that (31.77) may be replaced by 
H,:6 = 0, if we add an increment 6, to each observation in the second sample. 


Rank tests for the two-sample problem 

31.51 Just as in 31.21 during our discussion of tests of independence, so here in 
the two-sample problem we see that if we wish to be in a position to tabulate the exact 
permutation distribution of a test statistic for any m, we must remove the dependence 
of the test statistic upon the actual values of the observations, which are random vari- 
ables, and we are led to the use of rank tests, which are particularly appropriate because 
of their invariance under monotone transformation of the underlying variables, which 
leave the hypothesis (31.74) invariant. Once again, the simplest procedure is simply 
to replace the observations x, by their rank values, i.e. to rank the 2,-++-n, = nm observa- 
tions in a single sequence and replace the value x, by its rank X; We then have 
a set of m values X; which are a permutation of the first m natural numbers, of which 
n, belong to the first sample and m, to the second. 

Since, as we pointed out in 31.44, the statistic w is equivalent to using the mean *, 
of the first sample, the rank test obtained from w by replacing the observations by 
their ranks is equivalent to using | 


Sma BE (31.97) 
i=l] 


the sum of the ranks in the first sample, which is analogous to 7, of (31.21) since both 
arise from replacing observations by ranks. 


31.52 Now suppose that we seek an analogue of ft, defined by (31.23), 1.e. essen- 
tially of O as defined at (31.38). We should obviously expect, if the hypothesis (31.74) 
holds, that the observations from the first and second samples would be thoroughly 
‘“‘ mixed up ”’ with no tendency for the ranks in the first sample to cluster at either or 
both ends of the range from 1 to m. Define a statistic U which counts the number of 
times a member of the first sample exceeds a member of the second sample, i.e. 


1S ae = (31.98) 


i=1 j=1 

where h,; is defined at (31.36) as before. U ranges in value from 0 to m,m,. 
Whereas in the case of tests of independence there is a genuine choice between 

r, and ¢ as test statistics (although they are equivalent from the viewpoint of ARE, as 
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Consistency and unbiassedness of the Wilcoxon test 

31.56 It follows from the proof of consistency given in 31.50 for the w-test, which 
reduces to Wilcoxon’s test when ranks replace variate-values, that the Wilcoxon test 
is consistent against alternatives for which F’, and F,, the underlying parent distribu- 
tions, generate different mean ranks in their samples. Clearly, this will happen if and 
only if the probability p of an observation from F, exceeding one from F, differs 
from 3. ‘This result, given by Pitman (1948) and independently by later writers, may 
also be shown directly as indicated in Exercise 31.16. 


31.57 If we consider the one-sided alternative hypothesis that the second sample 
comes from a “ stochastically larger’ distribution, i.e. 
fe Se as (31.108) 
it is a simple matter to prove that both Pitman’s and Wilcoxon’s one-sided tests are 
unbiassed against (31.108). In fact, Lehmann (1951) showed the unbiassedness of 
any similar critical region for (31.74) against (31.108) which satisfies the intuitively 
desirable condition (C) that if any member of the second sample is increased in value, 
the sample point remains in the critical region. For any pair /’,,F., let us define 
a function A(x) by the equation 
F(h(x)) = F(x) (31.109) 
so that, from (31.108), 
{a} > mt. (31.110) 
Now consider the two samples with the », members of the second sample transformed 
from x; to h(x;) and the first sample unchanged. We see from (31.109) that the hypo- 
thesis of identical populations holds for the values thus transformed. If a region in 
the transformed sample space has content P, equation (31.110) and condition (C) 
assumed above ensure that for the untransformed sample space its content is « < P. 
Since « is the size of the test on the untransformed variables, and P its power against 
(31.108), we see that the test is unbiassed. 
The condition (C) is obviously satisfied by the one-sided Pitman and Wilcoxon 
tests. 


31.58 If we now consider the general two-sided alternative hypothesis 
Hi Po ge A), (31.111) 


or even the more limited location-shift alternative (31.76) with 6 unrestricted in sign, 
the Wilcoxon test is no longer unbiassed in general. For location-shift alternatives, 
Van der Vaart (1950, 1953) showed that if n, = n, or the common frequency function 
is symmetric about some value (not necessarily 6), the first derivative of the power 
function at 0 = 0 is zero if it exists, but that even then the test need not be unbiassed. 


The ARE of the Wilcoxon test 


31.59 We now confine ourselves to the location-shift situation (31.76), and find 
the ARE of Wilcoxon’s test compared to “ Student’s”’ t-test (which is the optimum 
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test of location-shift when F, is a normal distribution) when F, is an arbitrary con- 
tinuous d.f. with finite variance o?. 

“Student’s’” ¢ defined at (31.89) is known to be asymptotically equivalent to 
using the statistic (¢,—%,), which, whatever the form of F,, tends to normality with 


; 7-4 no 
mean —6 and variance o? {—+—) = . Thus we have 


{wee #9 |, f my Ms (31.112) 


var {(#,—%,)|0 = 0} no?’ 


We now have to evaluate the equivalent of (31.112) for the Wilcoxon statistic U. 
From the definition of U at (31.98), 


E(U) = nin, E (hy) = nynep, 


where p is the probability that an observation x, from the first distribution, F, (x), 
exceeds one x, from the second distribution, F,(x—6). ‘This is the probability that 
%,—x, < 0. Using the formula (11.68) for the d.f. of the sum of two random variables 
(with —y here being replaced by y in the argument of F, and suffixes 1,2 interchanged 
to give the df. of x,—x,), we find 


p=H)=|  FiO+s-)filady 
whence | 
P= -[" fe-OA()de 
and 


(31.113) and (31.105) give 


{Far a) 
20 Jono) _ ems” ad | 31.114 
var(U|0= 0) n+l ee ge 
Using (25.27), (31.112) and (31.114), we have for the ARE of Wilcoxon’s U test com- 
pared to “ Student’s”’ 7, 


Ay,t = 120° |” heya | (31.115) 


a result due to Pitman (1948). 


31.60 To evaluate (31.115), we only require the value of the integral 


[°_ heya = EG). 31.116) 
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(31.116) is easily evaluated for particular distributions. When F, is normal, we have 


E{f,(x)} = Qnot)AE exp |-4 Seal, 
= (2707) | {exp ove ob 


= (200%) (1-292) "] = (402). 
Thus, from (31.115) in the normal case 
Ay,: = 3/a = 0°95. (31.117) 


The result is scale-free, as it always is since oE{f,(x)} is scale-free. We may thus 
always re-scale in (31.115) if this is convenient. 


31.61 It is easy to see that (31.115) can take infinite values (cf. Exercise 31.18). 
We now inquire whether there is a non-zero minimum below which it cannot fall. 
We wish to minimize E{f(x)} for fixed o?, which we may take equal to unity. We 
— take E(x) = 0 without loss of generality. We thus require to minimize 


of 2(x)dx subject to the conditions t% F(x)de =I, pe f(x)dx = 1. Using 


as undetermined multipliers A, u, this is equivalent to minimizing the integral 


iz {f2 (x) 2A (u2— x2) f(x) } de. (31.118) 
Since f(x) is non-negative, (31.118) is minimized for choice of f(x) when 
= A(u?—x*), xe < pw, 
{a= 10. aes (31.119) 


a simple parabolic distribution. If 4 and uw are found from the conditions 


| f(a) dx = | s*/(@) ae at, 


we find 2=5, A= 3/(20V5), (31.120) 
whence | ‘ 16 3 
x)dx = is” = ——— 31.121 
f*(*) Oa ( ) 
Thus, from (31.121) and (31.115), the ARE for the distribution (31.119-120) is 
inf dy,, = 108/125 = 0-864. (31.122) 


31.62 The high value (31.122) for the minimum ARE of Wilcoxon’s test com- 
pared to “ Student’s ” ¢, which was first obtained by Hodges and Lehmann (1956), is 
very reassuring. In practical terms it means that in large samples we cannot lose more 
than 13-6 per cent efficiency in using Wilcoxon’s rather than ‘‘ Student’s ” test for 
a location shift ; on the other hand, we may gain a very great deal—cf. Exercise 31.18 
where it should be confirmed that for a Gamma distribution with parameter p = 1, 


Ay,; = 3. If the distribution is actually normal, the loss of efficiency is only about 
5 per cent, by (31.117). 
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Van der Vaart (1950) showed that in the normal case the derivatives of the power 
function differ very little for very small sample sizes from their asymptotic values 
relative to those for the “‘ Student’s ” ¢-test. Sundrum (1953) computed approximate 
power functions in the normal and rectangular cases for m, = mg = 10 which bear out, 
particularly in the normal case, the small power loss involved in using Wilcoxon’s 
rather than “‘ Student’s” test. Dixon (1954) and Hodges and Lehmann (1956) give 
some small-sample results for the normal case which confirm this. Witting (1960), 
by using an Edgeworth expansion to order -®, shows that the value (31.117) for the 
ARE in the normal case holds very closely for sample sizes ranging from 4 to 40. 

Chanda (1963) finds high ARE even for discrete populations. Noether (1967) 
shows that the test and the confidence intervals based upon it are conservative for 
discrete distributions. Haynam and Govindarajulu (1966) give exact power com- 
putations in the exponential and the rectangular cases. 


A test with uniformly better ARE than “ Student’s ” test 
31.63 Although Wilcoxon’s test performs very well compared to “‘ Student's ” 

test, as we have seen in 31.59-62, we can do even better. Reverting to our discussion 
of 31.39, we may obviously obtain ARE of 1 against normal location-shift alternatives 
by using the test statistic w (or equivalently the mean of the first sample, <,) with the 
observations replaced by the expected values of the order statistics in normal samples, 
which we denote by E(s,m) as in 31.39. The test statistic is therefore equivalent to 

(o= = x E(X;,n), (31.123) 

M1i=1 
where X, is the rank among the 7 observations of the ith observation in the first sample. 
(31.123) is usually called the normal scores or ¢, test statistic. If we define 
_ {1 if the sth observation is from the first sample, 
= * otherwise, 

we may rewrite (31.123) as 

ij = = x E(s, n) 2,. (31.124) 

Ny s=1 

Hoeffding (1952) first demonstrated that the c,-test has ARE 1 against normal alterna- 
tives. Terry (1952) gives its exact distribution for m,+n, < 10, and Klotz (1964) 
gives critical values for 2, + m2 < 20, when (31.74) holds. The asymptotic normality 
of c, (and a wide class of similar statistics), whatever the parent distributions F,, F, when 
lim ,/n, is bounded away from 0 and oo, was demonstrated by Chernoff and Savage 


(1958). Clearly 
B(c,|H) =— © Els,n) Els) = Sr 
Ny s=1 Ny NM g=1 
while var(c,| Ho) is given at (31.133) below. 


31.64 An alternative definition of c, is more convenient for the purpose of cal- 
culating its ARE. Define 
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H(x) = oF, (x)+2F (2), (31.125) 
the distribution function of the combined parent populations, and its sample analogue 
H,,(x) = “4 = (x) +S, (x), (31.126) 


where S,,,.5,, are the sample distribution functions as defined at (30.97). If we also 
define the function J, (x) by 


E(s,n) =Jn (<), (31.127) 
then c, may be defined in the integral form 
= | ” Ty Ha (x) }d Sy, (2). (31.128) 


As n—> © we have, from (31.125-126), H,(«) —> H(x), while from (31.127), under 
mild conditions (cf. 31.73)), J,(x) —> ®-1(«) = J (x), where @ is the standardized 
normal distribution function. Thus, as m —> oo we have from (31.128), under 
regularity conditions, 


2G} = | J{H@)}4F, (x). (31.129) 


Van der Waerden (1952, 1953) proposed a two-sample test based on the inverse 
normal d.f. transformation of the sample values, which is asymptotically equivalent to 
the c, test—cf. (31.129). 


31.65 If we now consider location-shift alternatives (31.76), and differentiate 
(31.129) with respect to the location-shift 6, we have 


= I {H()}{ GH} dr) (31.130) 
and since Hig) = “1 F(x) +“2F (#6) 
we have oH) = —f, (x—6). 
Putting this into (31.130), we find 
Ae) = Be" (Ha) } fae) dF (0). (31.131) 
Now when 6 = 0, H(x) = F(x), so (31.131) gives 
ee) =” Oe Pe (31.132) 


When H, holds, the variance of c, is simply, from (31.124), 


1 n 
var (c; |) = saver x E(s, n). 
1 s=1 


Here, only the z, are random variables ; in fact, they are identical 0-1 variables with 
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probability = of taking the value 1, but they are not independent, any pair having 


correlation igh because of the symmetry. ‘Thus 


var (c,| Hy) = 5 Ae Xx { E(s, n) }? var 2, + =e n) E(r, n)oov (ey 1) 


Var 3 


= Sf E (Bym) (3 een) 3 ceeny]} 


and since %& E(s, * = 0 by the symmetry of the normal distribution and 
s=1 


n n 
varz = — (1-5), this reduces to 
n n 


Me 
n(n—1)n,,— 
This is an exact result. When n—o, by 31.73), 


var (c,| Ho) = : AEG 2) #. (31.133) 


~ 2 {Els n)}2—> | ea (x) = 1 


and 
Vv gs Se 
ar (c, | Ho) eine 


Thus, from (31.132) and (31.134) 


(31.134) 


2 2 
far] ats BE JF (a) th (x) Pa (31.135) 


Using (25.27), (31.135) and (31.112) give for the ARE of the c, test compared to 
‘“‘ Student’s ” ¢ 


2 DAO AC Cae (31.136) 


In (31.136) we have put o? = 1 by standardizing F',(x). We now seek the minimum 
value of (31.136). 


31.66 It will be remembered that / (x) in (31.136) is defined as D-1(x), the inverse 
of the standardized normal d.f. Thus 


O{] (x)} = x. (31.137) 
Differentiating both sides, and writing ¢(x) = ®’(x) for the normal f.f., we have 


£ [OU (a)}] = ${J(x)}J' (x) = 1, 
so that Peis Wir 
— OTF) = Fi) 


so we have the equation in differentials 


dF, (x) = d®i/ [Fi (*)]}} = J [A (3d [i ()].- (31.139) 


(31.138) 


ROBUST AND DISTRIBUTION-FREE PROCEDURES 501 
(31.138) gives also 


i es oes 
POD ~ SHRED 
and the integral in square brackets in (31.136) therefore becomes 
1 [é¢VsasP _ a dx 
oes 4, wr “rere RAT AC) a cr 


To minimize (31.136), we require to minimize (31.140) subject to the standardiza- 
tion conditions 


| sar, - | seid) =:0; | saF, . [6d] ah. -(3i44t) 


The minimization is with respect to F, the suppressed argument of J in (31.140). 
This is equivalent to minimization with respect to x as a function of J. We therefore 
seek a monotone non-decreasing function x(J) which minimizes (31.140) subject to 


(31.141). 
31.67 Since ¢ is the standardized normal f.f., the restrictions (31.141) are obviously 


satisfied when 
x(J) =J (31.142) 
and (31.140) is then equal to 1. From (31.139), this occurs when dF, (x) = d® (x) 


so that F, is normal. ‘Thus we have verified our statement of 31.63 that in the normal 
case the c, test has ARE 1. 

Chernoff and Savage (1958) show that (31.142) is actually the unique solution of 
our minimization problem, i.e. that J > 1 for any non-normal F, so that the c, test 
has minimum ARE of 1 compared to the t-test. We present only a heuristic argument 
leading to their result. Let us use the representation x(/J) = /+p(/). 


If (31.141) holds, we have 
1 = varJ = var{x(J)} = varJ-+var{p(J)}+2cov{J,p(J)} 


cov {J,p(J)} < 0, (31.143) 
the equality in (31.143) being attained if and only if p(/) is a constant. By (31.141) 
again, E{p(J)} = 0, so (31.143) is an equality only when p(J) = 0. Let us neglect 
this case, which we treated at (31.142). We wish to minimize (31.140), which is 


and thus 


Ha}: We assume x(/) to be strictly increasing, so that x’(/J) > 0 and, by 


Exercise 9.13, excluding the degenerate equality, 


E{ 7p} > 1/E{x (D} = V+ Ee D3} 
Now (31.143) implies under certain conditions that E{p’(J)} < 0, for if J and p(/) 


have negative covariance, the “‘ average slope”’ of p(J) must be negative. Thus, if 


p(J) is not identically zero, 
1 
E srs 1, 
: (J i 


and hence x(J) = J gives the unique minimum of the ARE. 
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Mikulski (1963) has shown under regularity conditions that no other distribution than 
the normal has this remarkable property that the efficiency of the best rank-order test 
compared to the best location-shift test always exceeds unity when the underlying distri- 
bution differs from that assumed in deriving the tests. 

D. R. Cox (1964) considers similar uses of the exponential scores derived in Exercise 
19.11. 


31.68 The implication of the result of 31.63-67 is far-reaching. The c,-test is 
distribution-free (completely robust) and has minimum ARE of 1 compared to the 
standard normal-theory test based on “‘ Student’s ” ¢, which is only fairly robust. It is 
therefore difficult to make a case for the customary routine use of “ Student’s ” test in 
testing for a location-shift between two samples when sample numbers are reasonably 
large. The ¢-test has no appreciable advantage (even in the normal case for which it 
is optimum) for sample sizes of 4 or 5 with « near 0-05—cf. Klotz (1964). 

The labour of computing the c,-test is very light. It consists of referring 
¢:—E(c,| Ho) _ fee—0} eee 
{var (c, | H))}? NN» s { E(s, “| 

s=1 


to a table of the standardized normal distribution. Fisher and Yates’ Tables give 
3. {E(s,n)}? for n = 1 (1)50 to 4 dip. and the individual E(s,n) to 2 d.p. Harter 
s=] 


(1961a) gives all the E(s,n) to 5 d.p. for m = 2 (1) 100 (25) 250 (50) 400. 


For n = 50, E > { E(s, n) | = ()-97, and thereafter tends to 1 as we saw below 
s=1 


n 
(31.133), so that this factor may be dropped, reducing the standardized test statistic to 


(=) E EY 0: 


Ny Nog} j=1 


Hodges and Lehmann (1963) show that robust estimators of 8 may be obtained from 
the Wilcoxon and the normal-scores tests which have the same efficiency properties as 
the tests have ARE—cf. also Hoyland (1965). Ramachandramurty (1966) gives forms 
of these estimators which are robust to scale-shift, and therefore applicable to the general 
problem of two means. 

Pratt (1964) studies the effect of an unknown scale difference on the size of the Wil- 
coxon, normal-scores, ‘‘ Student’s”’ t and other location tests. Asymptotically, the 
t-test is appreciably more robust only if 1,/n, is very near 1, and even then the gain is 
small if the scale multiple does not exceed 2. 


31.69 Other tests for the two-sample problem, now rather overshadowed by the 
Wilcoxon and the c, tests, have been proposed. That of Wald and Wolfowitz (1940)— 
cf. Exercise 30.8—based on the number of runs, has the advantage of being consistent 
against the general alternatives (31.111) if lim 7,/n, is bounded away from 0 and ©. 
Smirnov (1939b) proposed a test based on the maximum absolute difference d between 


2 


1-4 
the two sample distribution functions, and showed that d 7 G + zy has the same limit- 
1 


ing distribution as that of Dyn? given at (30.132). The convergence of the sample d.f.’s 
to the parent d.f.’s ensures that the test is consistent against (31.111). A lower bound 
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for its power may be obtained just as for Dy in 30.60, but the test may be biassed—cf. 
Exercise 30.16 for the Dn test. 

Lehmann (1951) proposed a test which he showed to be always unbiassed. 

Although rather little is known of the power of these tests, it is clear that they are less 
efficient against normal shift alternatives. In fact, Mood (1954) shows the Wald- 
Wolfowitz test to have ARE of 0 against normal location-shift or normal scale-shift 
alternatives. 

The scale-shift alternative hypothesis 


Fi, (x) = Fy (5). 6>0 (31.144) 


in which we test Hy: 0 = 1, is equivalent to a location-shift alternative for the logarithms 


of the variables if they are non-negative. Generally, however, we must seek new tests 
Ny 


against (31.144). Mood (1954) proposed W = X {X;—4(n+1)}2, which is distribution- 
i=] 


free, and showed that in the normal case its ARE compared to the optimum variance-ratio 
test is 15/(2m*) = 0-76. For other parents, its ARE ranges from 0 to o, If (31.144) is 
replaced by the more general 


F,(x—m) = F, {(x—»)/6}, (31.145) 


where the unknown y, v are (nuisance) location parameters, Mood’s test remains asympto- 
tically distribution-free if « and » are first estimated by the sample medians (cf. Crouse, 
1964). 

We may obtain ARE of 1 against normal alternatives by applying the variance-ratio 
test to the normal scores E(.Xi, n)—cf. Capon (1961) who used the asymptotically equiva- 
lent statistic given by (31.123) with E(Xi, ) replaced by its square. Klotz (1962) used 
the square of the inverse normal d.f. transformation, which is also asymptotically equi- 
valent (cf. Van der Waerden’s test in 31.64) and tabulates critical values of this statistic 
for 8 < m,+n, < 20. Raghavachari (1965a, b) confirms that the ARE of Klotz’s test 
ranges from 0 to © as a function of F' in (31.144), and shows that its asymptotic properties 
against (31.144) persist against (31.145) when y, v are suitably consistently estimated from 
the samples, provided that F is symmetric and sufficiently regular. 

Siegel and Tukey (1960) proposed a scale-shift test which has the same H, distribution 
as the Wilcoxon test (because it uses the set of sample rank values; these are allocated to 
the observations according to their distance from the extremes of the ordering) but an 
ARE of only 6/z? = 0-61 in the normal case, as Klotz (1962) showed. 

Van Eeden (1964) gives conditions for these and other scale-shift tests to be con- 
sistent. ‘Tamura (1963) showed that against normal scale-shift the ARE of Mood’s 
test may be raised to 0-92 by using 5th powers instead of squares in its definition, and 
that for a test equivalent to the Siegel-Tukey test, the ARE may be raised to 0-96 by 
using tiny fractional powers of the ranks instead of the ranks themselves. 

Moses (1963) discusses scale-shift tests generally. 


k-sample tests 


31.70 The generalization of two-sample tests to the k-sample problem (Problem 


(2) of 31.13) is straightforward. ‘The hypothesis to be tested is that k > 2 continuous 
distribution functions are identical, i.e. 


tgs d fe) =... =; (x), all «x, (31.146) 


on the basis of independent samples of 1, observations (p = 1,2,...,) where 


KK 
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k 
xn, =n. In the parametric theory, all of the F, are assumed normal with common 
p=1 

unknown variance o? and different means 6,, so that 


iifee Pits oe eee ae 2 oS eee (31.147) 
with not all the 0, equal. (31.146) then becomes 
He: =o, ar >. (31.148) 


The LR test of (31.148) in the normal case is based on the statistic 


1 k 1 koem™ 
Fa gz En (@— at / zB E (4) 
(where , is the pth sample mean and # the overall sample mean), which is distributed 
in the variance-ratio (Ff) distribution with (k—1,n—k) degrees of freedom. This 
follows immediately from the general LR theory of Chapter 24, and is the simplest 
case of the Analysis of Variance, which we shall discuss in Volume 3. As (n—k) —> ©, 
it follows (cf. 16.22(7)) that 


k 
S= 1 n,(%,—%)?/o? (31.149) 
p=1 


is asymptotically a y? variate with k—1 degrees of freedom. 

This test has been shown by Gayen (1949-1951) to be remarkably robust to depar- 
tures from normality, and we shall be discussing the robustness of Analysis of Variance 
procedures in general in Volume 3. Here we consider only distribution-free substitutes 
for the normal theory test. 


31.71 Since (31.147) is clearly a generalization of the location-shift alternative 
(31.76) in the two-sample case, it is natural to seek generalizations of the two-sample 
tests to k samples. We consider two different approaches to the problem. First, 
suppose that we simply replace the observations x by their ranks X. ‘The statistic 
(31.149) then becomes 


S = ¥ n,{X,—} (n+ 1)}°/Gs (=D), (31.150) 


reducing to the Wilcoxon test when k = 2. Kruskal and Wallis (1952-1953) proposed 
the statistic H = (n—1)S/n, large values of H forming the critical region of the test. 
They demonstrated its asymptotic yg_, distribution as in the parametric case. For 
k = 3, n, <5, they tabulate its exact distribution in the neighbourhood of its critical 
values for « = 0-10, 0-05, 0-01. Kruskal (1952) showed that the H-test is consistent 
against any alternative hypothesis for which an observation from one of the parent 
distributions has probability 4 4 of exceeding a random observation from the k parents 
taken together. This is a generalization of the consistency condition for the Wilcoxon 
test in 31.56. 


Puri (1964) considers the statistic obtained by replacing the observations by any 
set of conventional numbers, gives conditions for its asymptotic normality and obtains 
its ARE, which is independent of k. So far as the H-test is concerned, this means 
that its ARE is given by the Wilcoxon expression (31.115) (cf. Andrews (1954)) and the 


ROBUST AND DISTRIBUTION-FREE PROCEDURES 505 


analysis of 31.601 is applicable without change to the H-test. If normal scores are used 
instead of ranks, we obtain the generalization of the c, test of (31.124), the ARE is given 
by (31.136), and by 31.66=7 this is at least 1 as before. 


31.72 In the two-sample case, we saw in 31.52 that it made no difference to the 
test statistic derived whether we replaced observations by ranks or counted the number 
of inversions of order between the samples. In the present k-sample case, it does 
matter. We now proceed to the inversions approach, and shall find that it leads to a 
different statistic from H of 31.71. 

Suppose that the statistic U of (31.98) is calculated for every pair of samples, there 
being 3k(k—1) pairs in all; we write U,, for the value obtained from the pth and gth 
samples (~, g = 1,2,...,k; p # q) and 

k k 
Ut Bee, eg. (31.151) 
pel g—p+i 


We may now very easily generalize the theory of 31.53. (31.100) is replaced by 
k 

O, = X Q,,+U, which leads to the c.f. relationship 
p=1 


k 

©y(0) = $u(0) / Th bn, (31.152) 

which is the analogue of (31.101). We find, corresponding to (31.103), for the c.g.f. 
of U 

2 ie a 

Wy (0) = bO( n?— 3S n?2)4+ 0 SR 

o() = 4i(n 28)+ 3 Gent. 


The cumulants of U are therefore 


go i( nt > ns), gig psig 


Me: 


kk Mp 
ses how (31.153) 


1 p=1 s=1 


31.154 
B,,; n k Mp : ( ) 
‘] s=1 p=1 s=1 
1 k 
In particular a= 73%" 2n+3)— x n(n, +3). (31.155) 
p=1 


31.73 The limit distribution of U also follows as in 31.55. If the ,—> © so 
that 2,/n remains bounded for all p, we write N for any 1, or m and see that xg; is at 
most of order (27+ 1) in N, with x, of order N°. Thus «,;/«i is of order N!~ at most 
and tends to zero for all 7 > 1, so that U tends to normality with mean and variance 
given by (31.154-155). Jonckheere (1954) shows that if only two of the k-sample sizes 
tend to infinity so that 2,/n, n,/n remain bounded, the distribution of U still tends to 
normality—this may be seen from the consideration that U is the sum of $k(k—1) 
(non-independent) Wilcoxon statistics U,,. If 7 sample sizes, r > 2, tend to infinity, 
3r(r—1) of the U,, will tend to normality and will dominate U. 

Jonckheere (1954) tabulated the exact distribution of U for samples of equal sizes m. 
iws_teble covers “hk = 50 n = 2s; RS 4, et = 2A eR = 8 w= 2 3 Send 
k = 6, m= 2. Beyond this point, the normal approximation is adequate for equal 
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sample sizes. Even for very unequal sample sizes, the normal approximation seems 
adequate for practical purposes when n > 12. 

Terpstra (1952), who originally proposed the k-sample U-test and derived the c.f. 
and limiting distribution given above, gave necessary and sufficient conditions for the 
consistency of the test. If the probability that an observation from the pth distribution 
exceeds one from the qth is (cf. (31.51)) 

P{ Xi > Xi} = Ite, PF q 
and the weighted sum of the é,, is 
Sa ee Men, 
D<9 


then the conditions for consistency (as 2 —>0o with n,/n bounded for all p) of the test 
using large values of U as critical region are (1) S, > 0, (2) (ku) °?S, — ©. These 
are direct generalizations of (31.58), the condition for consistency of Q in testing ran- 
domness against downward trend. 


31.74 So far as we know, the efficiencies of the two k-sample test statistics H and 
U have not been compared. The difficulty is that the forms of their limiting distribu- 
tions are different ; for fixed k, H has a non-central yj, distribution, and U a normal 
distribution asymptotically when H, holds and presumably also under general alter- 
natives. It seems likely that the U-test will be at its best when the alternatives are 
of the form (31.147) with 0, < 0, < ... < 4, or in the more general situation when 
(31.147) is replaced by : 
F, (x) < F,(x).<...< F,(@), ~ all =. (31.156) 
(31.156) may be referred to as an ordered alternative hypothesis. Bartholomew (1961) 
shows that U is asymptotically very efficient when the 0; are equally-spaced and the 
n, are all equal. ‘The H-test, on the other hand, is likely to be more efficient against 
broader, more general, classes of alternatives. 


Hogg (1962) gives a method of constructing distribution-free k-sample tests from two- 
sample tests. 


Tests of symmetry 

31.75 In all of the hypotheses discussed in this chapter, we have been fundamen- 
tally concerned with n independent observations (usually on a single variate x but, in 
the case of testing independence, on a vector (x,y)). Our hypotheses have specified 
that certain of these observations are identically distributed, and proceeded to test 
some hypotheses concerning their parent distribution functions. We found (cf. 31.16) 
that, to obtain similar tests of our hypotheses, we must restrict ourselves to permuta- 
tion tests, the distribution theory of which assigns equal probability to each of the m! 
orderings of the sample observations. 

An implication of this procedure is that the tests we have derived remain valid if 
the hypotheses we have considered are replaced by the direct hypothesis that the joint 
distribution function of the observations is invariant under permutation of its argu- 
ments. For example, consider a two-sample test of the hypothesis 


Fit b= fe); kes, (31.157) 
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where 7,, 7, are the respective sizes of random samples from the two distributions 
and n = n,+n,. Write G for the joint distribution function of the 2 observations. 
Replace H, by the hypothesis of symmetry 


WG (ew, Pay Cle ta 8s (31.158) 


where the 2’s are any permutation of the x’s. Then any similar test which is valid 
for (31.157) will be so for (31.158) also. This is not to say that the optimum proper- 
ties of a test will remain the same for both hypotheses—a discussion of this point 1s 
given by Lehmann and Stein (1949). However, it does imply that any test of (31.157) 
cannot be consistent against the alternative hypothesis (31.158). 

Practical situations are common in which a hypothesis of symmetry is appropriate. 
Since we have not discussed this problem so far even in the parametric case, we shall 
begin by a brief consideration of the latter in the simplest case. 


The paired t-test 


31.76 Suppose that variates x, and x, are jointly normally distributed with means 
and variances (/11, 07), (42,03) respectively and correlation parameter p. We wish to 
test the composite hypothesis 

Hy: A = y— be, = 0 (31,159) 
on the basis of m independent observations on («,,%,). Consider the variable 
y = %,—x,. It is normally distributed, with mean A and variance o? = of + o3—2p 010». 
We have m observations on y available and may therefore test Hy by the usual 
‘“‘ Student’s ”’ ¢-test for the mean applied to the differences (%,;—%;), 2 = 1,2,...,m. 
The procedure holds good when p = 0, when x, and x, are independent normal vari- 
ates, and in this particular situation the test is a special case of that given at (21.51) 
With 2; =, = 4, | 


31.77. Next simplify the example in 31.76 by putting of = of. The joint dis- 
tribution F (x1, x2) is now symmetric in x, and x, save possibly for their means. When 
H, holds, we have complete symmetry. We may therefore write (31.159) as 


Hei (a; hy = Pie; wl ei, Xs. (31.160) 


This is a typical symmetry hypothesis, which may formally be put into the general form 
(31.158) by writing G as the product of m factors (one for each observation on (x, X2) ). 

We now abandon the normal theory of 31.76 and seek distribution-free methods 
of testing the non-parametric hypothesis (31.160) for arbitrary continuous Ff. If we 
take differences y = x,—%, as before, we see that H, implies the symmetry of the dis- 
tribution of y about the point 0 or, if G is its d.f., 


H,:G(y) = 1-—G(-y), all ». (31.161) 


We have thus reduced the hypothesis (31.160) of symmetry of a bivariate d.f. in its 
arguments to the hypothesis (31.161) of symmetry of a univariate distribution about 
a particular value, zero. This hypothesis is clearly of interest in its own right (i.e. 
we may simply be interested in the symmetry of a single variate), and we proceed to 
treat the problem in this form. 
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31.78 ‘The hypothesis (31.161) implies that any observed absolute value of y, | ; |, 
has equal probability of arising from a positive and a negative value of y,;. ‘Thus there 
are 2” equiprobable samples generated by associating every observed | y, | with a posi- 
tive and a negative sign alternately, and combining all the | y; | in every possible way. 
We have formed the basis for a permutation test by permuting the sign attached to 
each | y, |, and now have to select the test statistic. If we consider the alternative 
that the observations are normal with mean 040, we find as in 31.18 and 31.44 that the 
most powerful permutation test is based on the sum of the observed y, (with their 
true signs, of course). We defer consideration of this test (suggested by Fisher (1935a)) 
since it is a special case of the Analysis of Variance, which we shall discuss in Chapter 
37, Volume 3. 


31.79 Fisher’s test of symmetry is equivalent to using the sum of the | y; | over 
observed positive values (since the sum of the | y, | is the same for each of the 2” per- 
mutations). For a rank test of symmetry, we may replace observations by ranks and 
obtain the Wilcoxon test of symmetry, based on the sum of the ranks of the | y; | over 
observed positive values; or we may use the equivalent of the c,-test; based on the 
expected values of the order-statistics from the positive half of a normal distribution. 
These three symmetry tests have ARE of 1, 3/z and 1, respectively, compared with 
the f-test against normal location-shift alternatives, just as for the two-sample location 
tests. Small-sample efficiency is high against normal alternatives (Klotz (1963)) for 
both rank tests—Klotz (1965) also considers an alternative measure of efficiency. 
The Wilcoxon symmetry test does not retain its high power under some non-normal 
(long tail) alternatives, but is still better than ‘‘ Student’s ” ¢, although worse than the 
Sign test (H. J. Arnold (1965)) to be discussed in 32.6. 

McCornack (1965) gives extensive tables of critical values for the Wilcoxon sym- 
metry test. 


The effects of discontinuities: continuity corrections and ties 

31.80 In various places, in this chapter as elsewhere, we have approximated dis- 
continuous distributions (in the present context the permutation distributions of test 
statistics) by their asymptotic forms, which are continuous. ‘The approximation is 
often, but not always—cf. Plackett (1964) and 33.27 below—improved if we apply a 
continuity correction, which amounts to the following simple rule: when successive dis- 
crete probabilities in the exact distribution occur at values 2,, 2), %3, the probability at 
%, is taken to refer to the interval ($(2, + 22), 3(22+23)). Thus, when we wish to evalu- 
ate the d.f. at the point z, from a continuous approximation, we actually evaluate it at 
the point 4(2,+ 23). 


31.81 ‘There is another question connected with continuity which we should 
discuss here. Our hypotheses have been concerned with observations from con- 
tinuous d.f.’s, and this implies that the probability of any pair of observations being 
precisely equal (a so-called tze) is zero and that we may therefore neglect the possibility. 
Thus we have throughout this chapter assumed that observations could be ordered 
without ties, so that the rank-order statistics were uniquely defined. However, in 
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practice, observations are always rounded off to a few significant figures, and ties will 
therefore sometimes occur. Similarly, if the true parent d.f.’s are not in fact con- 
tinuous, but are adequately represented by continuous d.f.’s, ties will occur. How are 
we to resolve the difficulty of obtaining a ranking in the presence of ties ? 

Two methods of treating ties have been discussed in the literature. The first 1s 
to order tied observations at random. ‘This has the merit of simplicity and needs no 
new theory, but obviously sacrifices information contained in the observations and may 
be expected to lead to loss of efficiency compared with the second method, which is 
to attribute to each of the tied observations the average rank of those tied. ‘There has 
been rather little investigation of the merits of the methods, but Putter (1955) shows 
that the ARE of the Wilcoxon test is less for random tie-breaking than when average 
ranks are allotted. Kruskal and Wallis (1952-1953) and Kruskal (1952) present a dis- 
cussion of ties in the H-test. 

Until further information is available, the average-rank method is likely to be the 
more commonly used. Unfortunately, it removes the feature of rank order tests 
which we have remarked, that their exact distributions can be tabulated once for all. 
For, if the average-rank method of tie-breaking is used, the sum of a set of ranks is 
unaffected but, e.g., their variance is changed. The permutation distribution for small 
sample sizes now becomes a function of the number and extent of the ties observed, 
and this makes tabulation difficult. Kendall (1962) gives full details of and references 
to the necessary adjustments for the rank correlation coefficients and related statistics 
(which include the Wilcoxon test statistic), and other discussions of adjustments have 
been mentioned above. 


31.82 Finally, we have seen in 31.56 and 31.62, and will see again in 32.9 and 
32.13, that if the parent distribution is discrete, methods based on the continuity 
assumption prove to be conservative. | 


EXERCISES 


31.1 By use of tables of the x? distribution, verify the values of the true probabilities 
in the table in 31.7. 


31.2 Verify that the distribution (31.18) has moments (31.12), (31.13) and (31.17). 


31.3 Ifr is the correlation coefficient defined at (31.11), if we transform the observed 
x- and y-values by X= t,(x), Y = t2(y), and calculate R, the correlation between the 
transformed values (X, Y), then every one of the equiprobable 2! permutations yields 
values 7, R. Show that the correlation between r and R over the n! permutations is 
given by 

C(r, R) = C; (x, X)C2(y,Y), 

i.e. that the correlation coefficient of the joint permutation distribution of the observed 
correlation coefficient and the correlation coefficient of the transformed observations is 
simply the product of the correlation between x and its transform with the correlation 
between y and its transform. (Daniels, 1944) 


31.4 Derive the fourth moment of the rank correlation coefficient given at (31.22) 
from the general expression in (31.14). 
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31.5 Show that (31.21) and (31.40) are alternative definitions of (31.19) by proving 
the identities (31.20) and (31.39). 


31.6 Using the definitions (31.37), (31.38), show that in the joint distribution of ¢ and 
r; over the n! equiprobable permutations in the case of independence, their correlation 
coefficient is 2(m+1)/{2n(2n+5) }. (cf. Daniels (1944) ) 


31.7 A sample of m pairs (x, y) is obtained from a continuous bivariate distribution. 
Let x, ) be the sample medians of x and y and define the statistic 


n 
u=  sgn(x—X)sgn(yi—J). 
i=1 


Show how u may be used to test the hypothesis of independence of x and ¥, and that its 
ARE compared to the sample correlation coefficient against bivariate normal alternatives 
is 4/m®. (This is called the medial correlation test.) (Blomqvist, 1950) 


31.8 In 31.36, use the result of Exercise 2.15, and the symmetry of the distribution 
of Q, to show that a sufficient condition for a size-« test of randomness to be strictly 
unbiassed against the alternatives (31.51) is that 


Sn > In(n—1)—(1—&) {4n (n—- 1) — Qo}, 


where Q, is the critical value of Q defined at (31.55). 
(cf. Mann (1945)) 


31.9 Show that (31.69) holds for the statistic V defined by (31.37) as well as for Q, 
and hence that 7; as a test of randomness has the same ARE as the other rank correlation 


coefficient ¢. 


31.10 In testing the hypothesis (31.49) of randomness against the normal regression 
alternatives (31.61), consider the class of test statistics 
S = Lwiyhi, 
where the summation contains 37 terms (n a multiple of 6), the suffixes 7, 7 each taking $n 
different values and all suffixes being distinct, while the wij are weights. ‘Thus S involves 


4n comparisons between independent pairs of observations. 
Show that the S-statistic with maximum ARE compared with b at (31.63-64) is 


Ss, = p> (n—2k+1) hy, n—k41 
k=1 


with ARE 
As,,b = (2/a)* = 0°86. 
(D. R. Cox and Stuart, 1955) 


31.11 In Exercise 31.10, show that if instead of S, we use the equally-weighted form 


an 
Sy = = 2 Ne gt 15 
k=1 


3\4 
the ARE is reduced to Se SS 0:78, but that the maximum ARE attainable by an 


S-statistic with all weights 1 or 0 is (16/92)? = 0-83, which is the ARE of 
$n 
Ss= 2 hy gnie 
k=1 
S3 involves only 4n comparisons, between the “ earliest’ and “ latest’ observations. 
(D. R. Cox and Stuart, 1955) 
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31.12 Define the statistic for testing randomness 
3n 
B= 2& sgn(xi—X) 
i=1 


where X is the median of the sample of size n (n even). Show that its ARE against normal 
alternatives is exactly that of S, in Exercise 31.11. 
(D. R. Cox and Stuart (1955); cf. G. W. Brown and Mood (1951)) 


31.13 N samples, each of size n, are drawn independently from a continuous distribu- 
tion with mean yu and variance o”, and the observations ranked from 1 to n in each sample. 
For the Nn combined observations, the correlation coefficient between the variate-values 
and the corresponding ranks is calculated. Show that as N—>oo this tends to 


12(n—1))? (” 
j= isaaee | = @)-1)4F@) 


_ (n—-1\3tA 

~~ \n+1) 20’ 
where A is Gini’s coefficient of mean difference defined by (2.24) and Exercise 2.9. 
In particular, show that for a normal distribution 


o{(E) 


so that C= hm C, = G/x)* = 0:98. (Stuart, 1954c) 
nm—>o 


31.14 Use (a) the theorem that the correlation between an efficient estimator and 
another estimator is the square root of the estimating efficiency of the latter (cf. (17.61)), 
(b) the relation between estimating efficiency and ARE given in 25.13, (c) Daniels’ 
theorem of Exercise 31.3, and (d) the last result of Exercise 31.13 to establish the results 
for the ARE of the rank correlation coefficient rs (and hence also 2) as a test of independence 
(31.48) and as a test of randomness (31.70) ; and also to establish the ARE of Wilcoxon’s 
rank-sum test against normal alternatives (31.117). 


(Stuart, 1954c) 


31.15 Obtain the variance of Wilcoxon’s test statistic given at (31.105) by considering 
the mean of a sample of n, integers drawn from the finite population formed by the first 
nm natural numbers. 

(Kruskal and Wallis, 1952-1953) 


31.16 In 31.56 show for the two-sample Wilcoxon test statistic U that whatever the 
parent distribution F,, Fs, 
E(U) = nyngf, 
var U = O(N), 
where N stands indifferently for 7, m,n. Hence, as 11, 23 —> © with n,/nz fixed, show 
that the test is consistent if p # 4. 
(Pitman, 1948) 


31.17 Show that for the logistic distribution of Exercise 17.5, the ARE of the Wil- 
coxon test compared to “‘ Student’s”’ t-test for a location shift is 22/9. 

(This with the results of Exercise 17.5 and 25.13 implies that the Wilcoxon test is 
asymptotically best. Capon (1961) shows that this is generally so for locally most powerful 
rank tests, which the Wilcoxon is in this case.) 
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31.18 For the distribution 
dF = exp(—x)x?"dx/T(p), O<x< Om, p> 3, 
show that the ARE of the Wilcoxon test compared to the “‘ Student’s ’* ¢-test for a shift 
in location between two samples is 
3p 
24-1) {(2p—1) B(p, p)}” 
a monotone decreasing function of p. Verify that Ay,; > 1:25 for p < 3. Show that 
as p—> 4, Av,t—> ©, and that as p—> 0, Ay,t—> 3/a, agreeing with (31.117). 


Av,t= 


31.19. Show that the H-test of 31.71 reduces when k = 2 to the Wilcoxon test with 
critical region equally shared between the tails of the test statistic. 


31.20 Using the result of 25.15 concerning the ARE of two test statistics with limiting 
non-central y? distributions with equal degrees of freedom and only the non-central 
parameter a function of the distance from Hp, establish that the k-sample H-test of 31.71 
has ARE, compared to the standard F-test in the normal case, equal to (31.115). 

(cf. Andrews, 1954) 


31.21 Show that in testing p = 0 for a bivariate normal population, the sample 
correlation coefficient r gives UMPU tests against one- and two-sided alternatives. 


31.22 Show that Wilcoxon’s test has ARE of 1 compared with “‘ Student’s ”’ t-test 


against a location-shift for rectangular alternatives. 


(Pitman, 1948) 


31.23 Using (31.115), (31.136) and (31.138), show that the ARE of the Wilcoxon 
test compared to the c, test for location-shift alternatives is 


: Jf Ove 
Av, ¢ = 12 [ {fi (x) Pade / i JF, (I ; 


6 
IU 


and hence that 


(Hodges and Lehmann (1961) show that 
both equalities can be attained.) 


31.24 Show that the intervals (31.95) based on the two-sample Wilcoxon statistic U 
at (31.98) are {D(U,), D(myn,+1—U,)}, where Prob {U< Uy | Ho} = 4a and D(r) is 
the rth smallest of the m,n, differences (x i— 2). (Lehmann, 1963) 


31.25 Show that the Wilcoxon test of symmetry, discussed in 31.79, yields distribu- 
tion-free confidence intervals for the location parameter of a symmetric distribution 
{A(W,,), A(an(n—1)+1—W.)} where W,, is the lower critical value of the “‘equal-tails”’ 
test and A(r) is the rth smallest of the 4n(n—1) averages 3(xi +4), ing Le. i. 

(Lehmann, 1963) 


CHAPTER 32 


SOME USES OF ORDER-STATISTICS 


32.1 In Chapter 31 we found that simple and remarkably efficient permutation 
tests of certain non-parametric hypotheses are obtained by the use of ranks, reflecting 
the order-relationships among the observations. In this chapter we first discuss the 
uses to which the order-statistics themselves can be put in providing distribution-free 
procedures for the non-parametric problems (5) and (6) listed in 31.13. We then go 
on to consider uses of order-statistics in other (parametric) situations. The reader 
is reminded that the general distribution theory of order-statistics was discussed in 
Chapter 14, and that the theory of minimum-variance unbiassed estimation of location 
and scale parameters by linear functions of the order statistics was given in Chapter 19. 
A valuable general review of the literature of order-statistics was given by Wilks (1948), 
whose extensive bibliography is supplemented by the later one of F. N. David and 
Johnson (1956). Expositions of many branches of the theory, with extensive tables, 
are given in Sarhan and Greenberg (1962). A review of the theory of ‘‘ spacings ” 
(differences between successive order-statistics) is given by Pyke (1965). 


Sign test for quantiles 


32.2 ‘The so-called Sign test for the value of a quantile of a continuous distribu- 
tion seems to have been the first distribution-free test ever used,“ but the modern 
interest in it dates from the work of Cochran (1937). 

Suppose that the parent d.f. is F(x) and that _ 


F(X,) =p (32.1) 
so that X, is the p-quantile of the distribution, i.e. the value below which 100p per 


cent of the distribution lies. For any p, 0 < p < 1, the value X, is a location value of 
the distribution. We wish to test the hypothesis 


ae (32.2) 


where x, is some specified value. (If we take x» as our origin of measurement for 
convenience, we wish to test whether X, is zero.) 


32.3 If we have a sample of n observations, we know that the sample distribution 
function will converge in probability to the parent d.f. Let us, then, observe the rela- 
tionship between the order-statistics xq), X@,..., Xin) and the hypothetical value of 
X, to be tested. We simply count how many of the sample observations fall below 


™) 'Todhunter (1865) refers to its use in simple form by John Arbuthnot (Physician to Queen 
Anne, and formerly a mathematics teacher) to support An Argument for Divine Providence taken 
from the constant Regularity observ’d in the Births of both Sexes (1710-1712) ; Arbuthnot was a 
well-known wit and the author of the satire The Art of Political Lying. 
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Xo, i.e. the statistic 
S= Dh(xp—xw) = UBh(xXo— 4%) (32.3) 
i=1 i=1 


where (cf. (31.36) ) 
z> 0, 


1, 
he) = 10, <0 
S counts the number of positive signs among the difference (x)—«;), and hence the 
test based on S is called the Sign test.“’ The distribution of S is at once seen to be 
binomial, for S is the sum of m independent observations on a 0-1 variable h(x 9—x) 
with 
P{h(xy—x) = 1} = P{x < xo} = P, 

say. The hypothesis (32.2) reduces to 

H,:P =p, (32.4) 
and we are simply testing the value of the binomial parameter P. We may wish to 
consider either one- or two-sided alternatives to (32.4). 

If we specify nothing further about the parent d.f. F(x), it is obvious intuitively 
that we cannot improve on S as a test statistic, and we find from binomial theory (cf. 
Exercise 22.2 and 23.31) that for the one-sided alternative H,: P > p, the critical region 
consisting of large values of S is UMP, while for the two-sided alternative H,: P # p, 
a two-tailed critical region is UMPU. 

In the most important case in practice, when p = 3 and we are testing the median 
of the parent distribution, we have a symmetrical binomial distribution for S, and the 
UMPU critical region against H, is the equal-tails one. 


A formal proof of these results is given by Lehmann (1959). 


32.4 For small sample size n, therefore, tables of the binomial distribution are 
sufficient both to determine the size of the Sign test and to determine its power against 
any particular alternative value of P, and thus its power function for alternatives H, or 
H,. Asn increases, the tendency of the binomial distribution to normality enables us 
to say that (S—nP)/{nP(1—P)}* has a standardized normal distribution. If we use 
a continuity correction as in 31.80 for the discreteness of S, this amounts to replacing 
|S—nP| by |S—nP|-—} in carrying out the test. 

In the case of the median, when we are testing P = }, the tendency to normality 
is so rapid that special tables are hardly required at all, since we need only compare 
the value of 

(| S—3n|—3)/( 30") (32.5) 
with the appropriate standardized normal deviate. Cochran (1937) gives exact critical 
values for n < 50 and test sizex = 0-05, Dixon and Mood (1946) for 2 < 100 and« = 0-25, 
0-1, 0-05 and 0-01, and MacKinnon (1964) for 2 = 1(1) 1000 and « = 0:50, 0-10, 0-05, 
0-02, 0-01 and 0-001. 


(*) Because of the continuity of the parent d.f., the event «i = x9 can only occur with prob- 
ability zero. If such “ ties” occur in practice, the most powerful procedure is to ignore these 
observations for the purposes of the test, as was shown by Hemelrijk (1952)—cf. 31.81. 
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Power of the Sign test for the median 

32.5 The approximate power of the Sign test is also easily ascertained by use of 
the normal approximation. Neglecting the continuity correction of 32.4, since this is 
small in large samples, we see that the critical region for the one-tailed test of P = 
against P > } is 

S > 4n+d,4n'i 

where d, is the appropriate normal deviate for a test of size «. ‘The power function 
is therefore approximately 


Q.(P) = |" 


ant ody ni/? 


{Inn P(1 -P)}-texp{—1 oan be 


= | 2 (2) exp (— Ht?) dt 


nt/2(4—P) + 3dq 
{P= P)} 7 


= n' (P—4)—34, 
ot tpa-Py — 


where G{x} is the normal d.f. From (32.6), it is immediate that as  —> oo the power 


— > 1 for any P > 3, so that the test is consistent. ‘The power function of the two- 
sided ‘‘ equal-tails”” test with critical region 


| S— an | > d:4 in} 


is similarly seen to be 


P—4)—-34d, ea 
Py = oi" se hee — St EN 32.7 
ee” * | (P0-BF [P(-P)} = 
which tends to 1 for any P44 asn—>oo. This = the consistency of the 
two-sided test against general seeminsin gs 


Dixon (1953b) tabulates the power of the two-sided Sign test for test sizes « < 0-05, 
a < 0-01, m ranging from 5 to 100 and P = 0-05 (0:05)0-95. MacStewart (1941) gives a 
table of the minimum sample size required to attain given power against given values of P. 
Gibbons (1964) examines the effect of non-normality on the power of the one-sided Sign 
test. 


The Sign test in the symmetrical case 


32.6 The power functions (32.6) and (32.7) are expressed in terms of the alter- 
native hypothesis value of P. If we now wish to consider the efficiency of the Sign test 
in particular situations, we must particularize the distribution further. If we return 
to the original formulation (32.2) of the hypothesis, and restrict ourselves to the case 
of the median X).; = M, say, we wish to test 


H,:M = M,. (32.8) 


If the parent distribution function is F(x) as before and the f.f. is f(x), we have for 
the value of P 


P = F(M,) = { e Siiiede. (32.9) 
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Suppose that we are interested in the relative efficiency of the Sign test where the 
parent F is known to be symmetrical, so that its mean and median M coincide. We 
may test the hypothesis (32.8) in this situation using as test statistic *, the sample 
mean. If F has finite variance o?, * is asymptotically normal with mean M and 
variance o?/n, and in large samples it is equivalent to the “‘ Student’s ”’ statistic 

oe 

s/(n—1)! 


where s* is the sample variance. For *%, we have 


3 
F(Z) M) = 
TY te a cal 


var («| M) = o*/n, 


0 2 
fay ze m)} 
oM =. 


For the Sign test statistic, we find it convenient first to measure from JW as origin, so that 


so that 


E(S|M) = nP =n | es fie) de, 
and thus 1S hea = —nf(0). 
Also - var (S| M,) 


Thus, transferring back to the natural origin, 


peggy 


N. 


| 
P| 


M=Mo _ 2 
eee cA 4n{f(M)}?. (32.11) 
From (32.10), (32.11) and (25.27) we find for the efficiency of the Sign test 
As; = 407{f(M)}, (32.12) 


a result due to Pitman. 


32.7. There is clearly no non-zero lower bound to (32.12), as there was to the 
ARE for the Wilcoxon and Fisher—Yates tests in Chapter 31, since we may have the 
median ordinate f(M/) = 0. In the normal case, f(M) = (2%0?)-, so (32.12) takes 
the value 2/z. Since we are here testing symmetry about M,, we may use the Wil- 
coxon test, as indicated in 31.78, with ARE 3/z in the normal case and always exceeding 
0-864. There is thus little except simplicity to recommend the use of the Sign test 
as a test of symmetry about a specified median: it is more efficient to test the sample 
mean in such a situation. ‘The Sign test is useful when we wish to test for the median 
without the symmetry assumption. 


Dixon (1953b) tabulates the power efficiency of the two-sided Sign test in the normal 
case (and gives references to earlier work, notably by J. Walsh). He shows that the 
relative efficiency (i.e. the reciprocal of the ratio of sample sizes required by the Sign 
test and the ‘‘ Student’s ”’ t-test to attain equal power for tests of equal size and against 
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the same alternative—cf. 25.2) decreases as any one of the sample size, test size, or the 
distance of P from #4 increases. 

Witting (1960) uses an Edgeworth expansion to order n~* and shows that this second- 
order approximation gives results for the ARE little different from (32.12). 


Distribution-free confidence intervals for quantiles 

32.8 The joint distribution of the order-statistics depends very directly upon the 
parent d.f. (cf., e.g., (14.1) and (14.2)) and therefore point estimation of the parent 
quantiles by order statistics is not distribution-free. Remarkably enough, however, 
pairs of order-statistics may be used to set distribution-free confidence intervals for 
any parent quantile. 

Consider the pair of order-statistics xy) and x), 7 < s, in a sample of m observa- 
tions from the continuous d.f. F(x). (14.2) gives the joint distribution of Ff, = F' (xq) 
and F, = F(x.) as 

i |e © ee oe eee © ee gee es 

= B(r,s—r) B(s,n—s+1) 

X,, the p-quantile of F(x), is defined by (32.1). Now the interval (x(), x) can only 
cover X, if F, < p < F,, and the probability of this event is simply 


art 
I-a= | | dG, 
O/ p 


where the first integral refers to Ff. This is 


(32.13) 


pel PP 
= | | i= | | iG, (32.14) 
0/0 0/70 
and since F, < F,, (32.14) may be written 
1 Fs 
— i | ae | | 4G. (32.15) 
0/0 0 0 


The double integrals on the right of (32.15) are easy to evaluate. In the first of them, 
the integration with respect to F, is over its entire range, and the integration from 
0 to p is therefore on the marginal distribution of F,, which by (14.1) is 
FU 1-—f,)* ‘dF, 
Or BGaarel) 
a Beta variate of the first kind whose d.f. is simply an Incomplete Beta Function. 
Hence 
pel p 
| | dG,,, = | dG, = I,(r,n—r+1). (32.16) 
0/0 0 
In the second double integral in (32.15), we make the substitution u = F,,/F,, v = F,, 
with Jacobian v, exactly as in 11.9, and find 
Fs (pp p 1 (u 2) ae (v—uv)s—*-! (1 —v)"-§ 
G, § = ae: See oY SSS eS ES b) 
\, \.4 : (tf, B(r,s—r)B(s,n—s+1) aul ode 
and on integrating out u over its entire range 0) to 1, we are left as before with a marginal 
distribution, this time of F’,, to be integrated from 0 to p. Thus 


Fs Pp Pp 
| | a= | ge = 1 e543). (32.17) 
0 0 


0 
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Putting (32.16-17) into (32.15), we have 
Ping < Xp < Xp}= 1-2 = 1, (7,2—74+1)-L6,2—s4+)). (32.18) 


32.9 We see from (32.18) that the interval (x(,%)) covers the quantile X, with 
a confidence coefficient which does not depend on F(x) at all, and we thus have a 
distribution-free confidence interval for X,. Since I,(a,b) = 1—J,-,(6,a), we may 
also write the confidence coefficient as 


l—a = J,_,(n—s+1,s)—JI,_,(n—r+1,r). (32.19) 
By the Incomplete Beta relation with the binomial expansion given in 5.7, (32.19) may 


be expressed as 
eS ee 
l-a = {= —=} (7) par = = (7) et (32.20) 


where g = 1—p. The confidence coefficient is therefore the sum of the terms in the 
binomial (q+ p)" from the (r+1)th to the sth inclusive. 

If we choose a pair of symmetrically placed order-statistics we have s = n—r +], 
and find in (32.18-20) 

1—« = I,(r,n—r+1)-I,(n—r-+1,r) 
1—{I,,(n—rt+1,r)4+1,(n—-r+1,n)}, (32.21) 
S (7) pi gh, (32.22) 
so that the confidence coefficient is the sum of the central (n—2r+1) terms of the 
binomial, r terms at each end being omitted. 

For any values of 7 and n, the confidence coefficient attaching to the interval 
(%))Xa—r41)) May be calculated from (32.21-2), if necessary using the Tables of the 


Incomplete Beta Function. The tables of the binomial distribution listed in 5.7 may 
also be used. Exercise 32.4 gives the reader an opportunity to practise the computation. 


Tukey and Scheffé (1945) show that if the parent distribution is discrete, the confidence 
intervals above cover Xp with probability > 1—«. 


32.10 In the special case of the parent median X).;, (32.21-22) reduce to 
1—o = 1—2)).,(n—r+1,r) = 2 5 (7) (32.23) 


a particularly simple form. This confidence interval procedure for the median was 
first proposed by Thompson (1936). 

MacKinnon (1964) gives tables for n = 1(1) 1,000 and « as nearly as possible equal 
to 0:50, 0-10, 0-05, 0-02, 0-01 and 0-001. Nair (1940) gave similar tables form = 6(1)81, 


stating the exact value of «, as close as possible to 0-05 and 0-01, in each case. 


Distribution-free tolerance intervals 


32.11 In 20.37 we discussed the problem of finding tolerance intervals for a 
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normal d.f. Suppose now that we require such intervals without making assumptions 
beyond continuity on the underlying distributional form. We require to calculate 
a randomly varying interval (/,u) such that 


Pt | f(x) de > y aif (32.24) 
l 
where f(x) is the unknown continuous frequency function. It is not obvious that such 
a distribution-free procedure is possible, but Wilks (1941, 1942) showed that the 
order-statistics %,),%%), provide distribution-free tolerance intervals, and Robbins 
(1944) showed that only the order-statistics do so. 

If we write 1 = xy, u = X) In (32.24), we may rewrite it 


PIL F (%@)-F @@)} 2 v] = B. (32.25) 
We may obtain the exact distribution of the random variable F'(«;))—F (xy) from 
(32.13) by the transformation y = F(%)—F (xm), 2 = F (xq), with Jacobian 1. 
(32.13) becomes 
ae ae (1 —y—z)"§ dy dz 
Bi(r,s—r)B(s,n—s+1) ’ 
In (32.26) we integrate out 2 over its range (0, 1—), obtaining for the marginal distri- 
bution of y 


adi, , = 


O<yte<. (32.26) 


gutta 
B(r,s—r) B(s,n—s+1) 
We put z = (1-y)t#, reducing (32.27) to 
ee ee ge | a { ort Reese 
a AP ee mee las 
B(r,n—s+1) 


<n gp FAY = asin &Fr ee oe 
ae BG 61) 


JG te 


1-y 
| 2-1(1—y—2)"-*de. (32.27) 
0 


ne : aS ? (1 os es dy < < 1 
pee eS ys x. (32.28) 
Thus y = F'(x;.))—F (xq)) is distributed as a Beta variate of the first kind. If we put 
r = 0 in (32.28) and interpret F(x) as zero (so that x@) = — 00), (32.28) reduces to 
(14.1), with s written for r. 


32.12 From (32.28), we see that (32.25) becomes 
1-,s—r—1(] nN—8t+T? day 

Ply > y}= cers = £, (32.29) 

which we may rewrite in terms of the Incomplete Beta Function as : 
P{F («@)—F (xm) > y} = 1-1,(s—r,n—-st+r+1) = 8B. (32.30) 
The relationship (32.30) for the distribution-free tolerance interval (x,), x) contains 
five quantities : y (the minimum proportion of F(x) it is desired to cover), B (the prob- 
ability with which we desire to do this), the sample size n, and the order-statistics’ 


positions in the sample, r and s. Given any four of these, we can solve (32.30) for the 
it 
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fifth. In practice, B and y are usually fixed at levels required by the problem, and r 
and s symmetrically chosen, so that s = n—r+1. (32.30) then reduces to 
I, (n—2r+1,2r) = 1-8. (32.31) 
The left-hand side of (32.31) is a monotone increasing function of m, and for any 
fixed B, y, r we can choose n large enough so that (32.31) is satisfied. In practice, we 
must choose 7 as the nearest integer above the solution of (32.31). If 7 = 1, so that 
the extreme values in the sample are being used, (32.31) reduces to 
I,(n—1,2) = 1-8, (32.32) 
which gives the probability 6 with which the range of the sample of m observations 
covers at least a proportion y of the parent df. 
The solution of (32.30) (and of its special cases (32.31-32)) has to be carried out 
numerically with the aid of the Tables of the Incomplete Beta Function, or equivalently 
(cf. 5.7) of the binomial d.f. Murphy (1948) gives graphs of y as a function of m for 
B = 0:90, 0:95 and 0:99 and r + (n— s+ 1) = 1(1)6(2) 10 (5) 30 (10) 60 (20) 100 ; these 
are exact for n < 100, and approximate up to nm = 500. 


Example 32.1 


We consider the numerical solution of (32.32) for n. It may be rewritten 
= 


0 
= ny"-1—(n—1)y". (32.33) 
For the values of 8, y which are required in practice (0-90 or larger, usually), is 

so large that we may write (32.33) approximately as 


1-B=ny*""(1-y), 


y= {(7=5) =e (32.34) 


or logn+(n—1) logy = log{(1—f)/(1—y)}. (32.35) 

The derivative of the left-hand side of (32.35) with respect to m is (1/n)+ log y 
and for large n the left-hand side of (32.35) is a monotone decreasing function of n. 
Thus we may guess a trial value of m, compare the left with the (fixed) right-hand side 
of (32.35), and increase (decrease) n if the left (right) is greater. ‘The value of 7 satis- 
fying the approximation (32.35) will be somewhat too large to satisfy the exact rela- 
tionship (32.33), since a positive term y" was dropped from the right of the latter, 
and we may safely use (32.35) unadjusted. Alternatively, we may put the solution of 
(32.35) into (32.33) and adjust to obtain the correct value. 


Example 32.2 
We illustrate Example 32.1 with a particular computation. Let us put 
B = y = 0-99. (32.35) is thea 
logn+(n—1)log0-99 = 0, 
the right-hand side, of course, being zero whenever B=y. We may use logs to base 10, 
since the adjustment to natural logs cancels through (32.35). ‘Thus we have to solve 
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log,)>n—0-00436(n—1) = 0. 
We first guess nm = 1000. ‘This makes the left-hand side negative, so we reduce n to 
500, which makes it positive. We then progress iteratively as follows : 


n log 49” 0:00436 (n—1) 
1000 3 4°36 

500 26990 2°18 

700 2°8451 3-05 

650 2°8129 2°83 

600 2°7782 2°61 

640 2:8062 2:79 

645 2°8096 2°81 


We now put the value m = 645 into the exact (32.33). Its right-hand side is 
645 (0-99)44— 644 (0-99)#45 = 1-004—0-992 = 0-012. 


Its left-hand side is 1— 6 = 0-01, so the agreement is good and we may for all practical 
purposes take m = 645 in order to get a 99 per cent tolerance interval for 99 per cent 
of the parent df. 


32.13 We have discussed only the simplest case of setting distribution-free tolerance 
intervals for a univariate continuous distribution. Extensions to multivariate tolerance 
regions, including the discontinuous case, have been made by Wald (1943b), Scheffé and 
Tukey (1945), Tukey (1947, 1948), Fraser and Wormleighton (1951), Fraser (1951, 1953), 
and Kemperman (1956). Wilks (1948) gives an exposition of the developments up to 
that date. Walsh (1962) considers symmetrical continuous distributions. Goodman 
and Madansky (1962) consider parameter-free and distribution-free tolerance limits for 
the exponential distribution. 

Scheffé and Tukey (1945) and Tukey (1948) show that if the parent distribution is 
discrete, the above tolerance intervals and regions have probability > 1—«. 


Point estimation using order-statistics 


32.14 As we remarked at the beginning of 31.8, we cannot make distribution-free 
point estimates using the order-statistics because their joint distribution depends 
heavily upon the parent d.f. F(x). We are now, therefore, re-entering the field of 
parametric problems, and we ask what uses can be made of the order-statistics in esti- 
mating parameters. These are two essentially different contexts in which the order- 
statistics may be considered : 


(1) We may deliberately use functions of the order-statistics to estimate parameters, 
even though we know these estimating procedures are inefficient, because of 
the simplicity and rapidity of the computational procedures. (We discussed 
essentially this point in Example 17.13 in another connexion.) In 14.6-7 we 
gave some numerical values concerning the efficiencies of multiples of the 
sample median and mid-range as estimators of the mean of a normal population, 
and also of the sample interquartile range as an estimator of the normal popula- 
tion standard deviation. ‘These three estimators are examples of easily com- 
puted inefficient statistics. 

(2) For some reason, not all the sample members may be available for estimation 
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purposes, and we must perforce use an estimator which is a function of only 
some of them. ‘The distinction between (1) and (2) thus essentially concerns 
the background of the problem. Formally, however, we may subsume (1) 
under (2) as the extreme case when the number of sample members not available 
is equal to zero. 


Truncation and censoring 

32.15 Before proceeding to any detail, we briefly discuss the circumstances in which 
sample members are not available. Suppose first that the underlying variate x simply 
cannot be observed in part or parts of its range. For example, if x is the distance from 
the centre of a vertical circular target of fixed radius R on a shooting range, we can only 
observe x for shots actually hitting the target. If we have no knowledge of how many 
shots were fired at the target (say, 7) we simply have to accept the m values of x observed 
on the target as coming from a distribution ranging from 0 to R. We then say that 
the distribution of x is truncated on the right at R. Similarly, if we define y in this 
example as the distance of a shot from the vertical line through the centre of the target, 
y may range from —R to +R and its distribution is doubly truncated. Similarly, we 
may have a variate truncated on the left (e.g. if observations below a certain value are 
not recorded). Generally, a variate may be multiply truncated in several parts of its 
range simultaneously. A truncated variate differs in no essential way from any other 
but it is treated separately because its distribution is generated by an underlying un- 
truncated variable, which may be of familiar form. ‘Thus, in Exercise 17.27, we con- 
sidered a Poisson distribution truncated on the left to exclude the zero frequency. 

Tukey (1949) and W. L. Smith (1957) have shown that truncation at fixed points does 
not alter any properties of sufficiency and completeness possessed by a statistic. 


32.16 On the other hand, consider our target example of 32.15 again, but now 
suppose that we know how many shots were fired at the target. We still only observe 
m values of x, all between O and R inclusive, but we know that n—m = r further 
values of x exist, and that these will exceed R. In other words, we have observed the 
first m order-statistics xq), ..., X¢m) in a sample of size x. ‘The sample of x is now 
said to be censored on the right at R. (Censoring is a property of the sample whereas 
truncation is a property of the distribution.) Similarly, we may have censoring on the 
left (e.g. in measuring the response to a certain stimulus, a certain minimum response 
may be necessary in order that measurement is possible at all) and double censoring, 
where the lowest 7, and the highest r, of a sample of size m are not observed, only the 
other m = n—(r,+7.) being available for estimation purposes. 

There is a further distinction to be made in censored samples. In the examples 
we have mentioned, the censoring arose because the variate-values occurred outside 
some observable range; the censoring took place at certain fixed points. ‘This 1s 
called Type I censoring. Type II censoring is said to occur when a fixed proportion 
of the sample size n is censored at the lower and/or upper ends of the range of x. In 
practice, T'ype II censoring often occurs when x, the variate under observation, is a 
time-period (e.g., the period to failure of a piece of equipment undergoing testing) and 
the experimental time available is limited. It may then be decided to stop when the 
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first m of the m observations are to hand. It follows that Type II censoring is usually 
on the right of the variable. 

From the theoretical point of view, the prime distinction between Type I and 
Type II censoring is that in the former case m (the number of observations) is a random 
variable, while in the latter case it is fixed in advance. ‘The theory of Type II cen- 
soring is correspondingly simpler. 

Of course, single truncation or censoring is merely a special case of double trunca- 
tion or censoring, where one terminal of the distribution is unrestricted, while an 
‘ordinary ”’ situation is, so to speak, the doubly extreme case when there is no restric- 
tion of any kind, 


32.17 ‘There is by now an extensive literature on problems of truncation and 
censoring. ‘To give a detailed account of the subject would take too much space. We 
shall therefore summarize the results in sections 32.17-22, leaving the reader who is 
interested in the subject to follow up the references. We classify estimation problems 
into three main groups. 


(A) Maximum likelihood estimators 

A solution to any of the problems may be obtained by ML estimation ; the likeli- 
hood equations are usually soluble only by iterative methods. For example if a con- 
tinuous variate with frequency function f(«|0) is doubly truncated at known points 
a,b, with a < b, the LF if nm observations are made is 


Ln(x|0) = I fleilo)/{ | felnary, (32.36) 


the denominator in (32.36) arising because the truncated variate has f.f. 


(ul) / | f(a. 


(32.36) can be maximized by the usual methods. 
Consider now the same variate, doubly censored at the fixed points a,b, with r, 
small and r, large sample members unobserved. For this Type I censoring, the LF is 


Eisib) { | fel 6) ax" ss F(x; |0) { | : f(x] 6) a\" (32.37) 


and r, and r, are, of course, random variables. 
On the other hand, if the censoring is of Type II, with r, and r, fixed, the LF is 


Lax (%|6) 2 4 | a? F(x ode} -_ Flv 19){ 2 | F(e|0) ae" (32.38) 


ae (n—T, 
(32.37) and (32.38) are of exactly the same form. They differ in that the limits of 
integration are random variables in (32.38) but not in (32.37), and that r,,7, are ran- 
dom variables in (32.37) but not in (32.38). Given a set of observations, however, the 
formal similarity permits the same methods of iteration to be used in obtaining the 
ML solutions. Moreover, as »—>oo, the two types of censoring are asymptotically 
equivalent. 
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B. R. Rao (1958a) showed under regularity conditions that censoring always results 
in a loss of estimation efficiency, but that truncation need not do so. This is also true 
(Swamy (1962a)) if the observations are grouped. Cf. Exercise 32.26. 

Halperin (1952a) showed, under regularity conditions similar to those of 18.16 and 
18.26, that the ML estimators of parameters from 'Type II censored samples are con- 
sistent, asymptotically normally distributed, and efficient—cf. Exercise 32.15. 

Hartley (1958) gives a general method for iterative solution of likelihood equations 
for incomplete data (covering both truncation and censoring) from discrete distributions. 


(B) Minimum variance unbiassed linear estimators 


A second approach is to seek the linear function of the available order statistics 
which is unbiassed with minimum variance in estimating the parameter of interest. 
To do this, we use the method of LS applied to the ordered observations. We have 
already considered the theory when all observations are available in 19.18-21, and this 
may be applied directly to truncated situations, provided that the expectation vector 
and dispersion matrix of the order-statistics are calculated for the truncated distribution 
itself and not for the underlying distribution upon which the truncation took place. 
The practical difficulty here is that this dispersion matrix is a function of the trunca- 
tion points a,b, so that the MV unbiassed linear function will differ as a and 6 vary. 
There has been little or no work done in this field, presumably because of this difficulty. 

When we come to censored samples, a difficulty persists for Type I censoring, since 
we do not know how many order-statistics will fall within the censoring limits (a, d). 
Thus an estimator must be defined separately for every value of 7, and r, and its expec- 
tation and variance should be calculated over all possible values of 7; and r, with the 
appropriate probability for each combination. Again, we know of no case where this 
has been done. However, for Type II censoring, the problem does not arise, since 
r, and r, are fixed in advance, and we always know which (n—7r,—r,) order-statistics 
will be available for estimation purposes. Given their expectations and dispersion 
matrix, we may apply the LS theory of 19.18-21 directly. Moreover, the expecta- 
tions and the dispersion matrix of all 7 order-statistics need be calculated only once 
for each m. For each r,,r, we may then select the (n—7r,—r,) expectations of the 
available observations and the submatrix which is their dispersion matrix. 

Chernoff et al. (1967) prove general formulae for linearly combining functions of 
the order-statistics to estimate location and scale parameters fully efficiently in censored 
or uncensored samples. 

A number of authors have suggested simpler procedures to avoid the computational 
complexities of the ML and LS approaches. The most general results have been 
obtained by Blom (1958), who derived “ nearly’ unbiassed “ nearly ”’ efficient linear 
estimators, as did Plackett (1958), who showed that the ML estimators of location and 
scale parameters are asymptotically linear, and that the MV linear unbiassed estimators 
are asymptotically normally distributed and efficient. Thus, asymptotically at least, 
the two approaches draw together. 

Gastwirth (1966) examines the relation between MV linear unbiassed estimators 
and the corresponding asymptotically most powerful rank tests. 
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32.18 We now briefly give an account of the results available for each of the 
principal distributions which have been studied from the standpoint of truncation and 
censoring ; the numerical details are too extensive to be reproduced here. 


The normal distribution 


Swamy (1962b) shows that truncation always reduces efficiency when both mean and 
variance are estimated and (1963) usually also does so when the observations are grouped 
—cf. Grundy (1952). For single and double truncation, ML estimation has been 
discussed by Cohen (1950a, 1957) who gives graphs to aid iterative solution of ML 
equations; Cohen and Woodward (1953) give tables, while Hald (1949) and Halperin 
(1952b) give graphs, for ML estimation with single truncation. Iterative ML pro- 
cedures for singly and doubly Type II censored samples are given by Harter and Moore 
(1966a) who review earlier work. The ML estimators tend to be somewhat more 
precise, especially when censoring is strongly asymmetric, than the MV unbiassed linear 
estimators studied by Sarhan and Greenberg (1956, 1958) whose book (1962) gives 
tables of the coefficients of these estimators for all combinations of tail censoring num- 
bers 7,7, when m = 1(1)20. The linearized ML estimators proposed by Plackett 
(1958) never have efficiency less than 99-98 per cent for m = 10. Dixon (1957) shows 
that for estimating the mean of the population the very simple “ trimmed ”’ estimator 


1 n—1 


— 2. wo) 


never has efficiency less than 99 per cent for n = 3(1)20, and presumably for n > 20 
also, while the mean of the “‘ best two ”’ observations (i.e. those whose mean is unbiassed 
with minimum variance) has efficiency falling slowly from 86-7 per cent at n = 5 to 
its asymptotic value of 81 per cent. The “ best two’ observations are approximately 
X(0-27) aNd Xo.73n) (cf. Exercise 32.14). Similar simple estimates of the population 
standard deviation o are given by unbiassed multiples of the statistic 


= X {Xn —t4.1) — XH 7, 
v 


t= 


the summation containing 1, 2, 3, or 4 values of 7. The best statistic of this type never 
has efficiency less than 96 per cent in estimating o. 

Dixon (1960) shows that if 7 observations are censored in each tail, the ‘ Winsor- 
ized’ estimator of the mean 


] n—i—1 
Ny = = p> X(s) +¢+ 1) (x(; +1 +2) | 
s=1+2 


has at least 99-9 per cent efficiency compared with the MV unbiassed linear estimator, 
and that for single censoring of 2 observations (say, to the right) the similar estimator 


1 n—i-—1 
mn, = saai| = p> Pas lpees eo tax) 


with a chosenon make m, unbiassed, is at least 96 per cent efficient. 


Some general results on the efficiency of trimmed and Winsorized estimators of the 
mean, for symmetric and symmetric unimodal distributions, are given by Bickel (1965). 
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Walsh (1950a) shows that estimation of a percentage point of a normal distribution 
by the appropriate order-statistic is very efficient (although the estimation procedure 
is actually valid for any continuous d.f.) for Type II single censoring when the great 
majority of the sample is censored. 

Saw (1959) has shown that in singly Type II censored samples, the population 
mean can be estimated with asymptotic efficiency at least 94 per cent by a properly 
weighted combination of the observation nearest the censoring point (x,) and the simple 
mean of the other observations, and the population standard deviation estimated with 
asymptotic efficiency 100 per cent by using the sum and the sum of squares of 
the other observations about x,. Saw gives tables of the appropriate weights for 
n<20. For Type I censored samples, Saw (1961) proposes simple linear estimators 
of high efficiency. 


32.19 The exponential distribution 

The distribution f(x) = exp{—(x—y)/o}/o, uw < x < o, has been studied very 
fully from the standpoint of truncation and censoring, the reason being its importance 
in studies of the durability of certain products, particularly electrical and electronic 
components. A very full bibliography of this field of life testing is given by Menden- 
hall (1958), supplemented by Govindarajulu (1964). 

ML estimation of o (with « known) for single truncation or Type I censoring on 
the right is considered by Deemer and Votaw (1955)—cf. Exercise 32.16. ‘Their 
results are generalized to censored samples from mixtures of several exponential dis- 
tributions by Mendenhall and Hader (1958). For Type II censoring on the right, the 
ML estimator of o is given by Epstein and Sobel (1953), and the estimator shown to 
be also the MV unbiassed linear estimator by Sarhan (1955)—cf. Exercises 32.17-18. 

Sarhan and Greenberg (1957) give tables, for sample sizes up to 10, of the coeffi- 
cients of the MV unbiassed linear estimators of o alone, and of (,o) jointly, for all 
combinations of Type II censoring in the tails. MV unbiassed estimators based on 
one or two order-statistics are given by Harter (1961b), Sarhan et al. (1963) and Siddiqui 
(1963); and those based on 3, 4 or 5 order-statistics by Kulldorff (1963) who gives some 
general theory. See also Laurent (1963). Saleh (1966) derives estimators based on k 
order-statistics. 


32.20 The Poisson distribution 

Cohen (1954) gives ML estimators and their asymptotic variances for singly and 
doubly truncated and (Type I) censored Poisson distributions, and discusses earlier, 
less general, work on this distribution. Cohen (1960b) gives tables and a chart for 
ML estimation when zero values are truncated. ‘Tate and Goen (1958) obtain the 
MV unbiassed estimator when truncation is on the left, and, in the particular case 
when only zero values are truncated, compare it with the (biassed) ML estimator and 
a simple unbiassed estimator suggested by Plackett (1953)—cf. Exercises 32.20, 32.224. 

Cohen (1960a) discusses ML estimation of the Poisson parameter and a parameter 0, 
when a proportion 0 of the values ‘‘ 1” observed are misclassified as ‘‘ 0,” and the 
same author (1960c) gives the ML estimation procedure when the zero values and 
(erroneously) some of the “1” values have been truncated. 
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32.21 Other distributions 
For the Gamma distribution with three parameters 


dF = Fp eee Hale—w Ft dla(e—1)} 


Chapman (1956) considers truncation on the right, and proposes simplified estimators 
of («, 8) with w known and of («, 6,4) jointly. Cohen (1950b) had considered estima- 
tion by the method of moments in the truncated case. Raj (1953) and Den Broeder 
(1955) considered censored and truncated situations, the latter paper being concerned 
with the estimation of a « alone with restriction in either tail of the distribution. Wilk 
et al. (1962) give ML estimators of («,/) with uw known, and of («, 8, ) jointly, for 
Type II censoring on the right. 

Sarhan and Greenberg (1959) consider MV unbiassed linear estimation for rect- 
angular distributions—cf. Exercise 32.25. Downton (1966) considers the extreme- 
value distribution treated in Exercise 18.6. Govindarajulu (1966) does the same for 
the symmetrically censored double exponential fr.f. Finney (1949a), Rider (1955), 
Sampford (1955), Wilkinson (1961) and S. M. Shah (1961) discuss singly truncated 
binomial and negative binomial distributions. 5. M. Shah (1966) discusses the doubly 
truncated binomial. 

Harter and Moore (1966b) consider local ML estimation for censored samples from 
the 3-parameter lognormal distribution (with unknown starting-point); the strict ML 
estimator is infinite—cf. Exercise 18.23. 


Tests of hypotheses in censored samples 


32.22 In distinction from the substantial body of work on estimation discussed 
in 32.17-21, very little work has so far been done on hypothesis-testing problems for 
truncated and censored situations. Epstein and Sobel (1953) and Epstein (1954) 
discuss tests for censored exponential distributions ; F. N. David and Johnson (1954, 
1956) give various simple tests for censored normal samples based on sample medians 
and quantiles. Halperin (1961a) gives simple confidence intervals for singly censored 
exponential and normal samples. 

Gehan (1965a, b) (cf. also Halperin (1960)) has extended the Wilcoxon test to 
censored samples. Gastwirth (1965) derives asymptotically most powerful rank tests 
for censored samples. 


Outlying observations 


32.23 In the final sections of this chapter, we shall briefly discuss a problem which, 
at some time or other, faces every practical statistician, and perhaps, indeed, most 
practical scientists. ‘The problem is to decide whether one or more of a set of obser- 
vations has come from a different population from that generating the other observa- 
tions ; it is distinguished from the ordinary two-sample problem by the fact that we 
do not know in advance which of the set of observations may be from the discrepant 
population—if we did, of course, we could apply two-sample techniques which we 
have discussed in earlier chapters. In fact, we are concerned with whether “ con- 
tamination”? has taken place. 


528 THE ADVANCED THEORY OF STATISTICS 


The setting in which the problem usually arises is that of a suspected instrumental 
or recording error; the scientist examines his data in a general way, and suspects 
that some (usually only one) of the observations are too extreme (high, or low, or both) 
to be consistent with the assumption that they have all been generated by the same 
parent. What is required is some objective method of deciding whether this suspicion 
is well-founded. 

Because the scientist’s suspicion is produced by the behaviour in the tails of his 
observed distribution, the ‘‘ natural ”’ test criteria which suggest themselves are based 
on the behaviour of the extreme order-statistics, and in particular on their deviation 
from some measure of location for the unsuspected observations; or (especially in the 
case where “ high” and “low ” errors are suspected) the sample range itself may be 
used as a test statistic. ‘Thus, for example, Irwin (1925) investigated the distribution 
of (X(9) — X(p-1))/o in samples from a normal population (see also E. S. Pearson (1926) 
and Sillito (1951)), and “ Student ” (1927) recommended the use of the range for 
testing outlying observations. 

Since these very early discussions of the problem, a good deal of work has been done 
along the same lines, practically all of which considers only the case of a normal parent. 
Now it is clear that the distribution of extreme observations is sensitive to the parental 
distributional form. (cf. Chapter 14), so that these procedures are very unlikely to 
be robust to departures from normality, but it is difficult in general to do other than 
take a normal parent—the same objection on grounds of non-robustness would lie for 
any other parent. 


32.24 Ferguson (1961), following Dixon (1950), sets up two general alternative 
hypotheses. Model A (the location-shifts hypothesis) is that n independent normal 
observations x; have common variance o? and means F'(x,;) = «+oAa,, where the a’s 
are known constants, not all equal, and (71, 73, ..., ¥,) is an unknown permutation 
of the integers 1 to nm. Model B (the scale-shifts hypothesis) is that the x, are inde- 
pendent and normal with common mean uw and variances V (x,;) = o? exp (Aa,). We 
test H,: A = 0 in either model. Considering only tests invariant under location and 
scale changes, Ferguson (1961) shows that in Model A, with uw unknown, the locally most 
powerful test of Hy against the one-sided H,: A > Ois based on 4/d,, the skewness co- 
efficient of the x’s. Large values of 1/0, are critical if k,(a), the third k-statistic of the a’s, 
is positive, and small values of +/d, are critical if k,(a) < 0. If (as occurs in problems 
of outliers), (n—/) of the a’s are zero and //n < i, then k,(a) > 0, so that large values of 
/b, are always critical if less than half the observations are outliers. For the two-sided 
H,:A +0, the test based on b,, the kurtosis coefficient of the x’s, is locally most 
powerful unbiassed, large or small values being critical according as k,(a) > 0 or < 0. 
If (n—/) of the a’s are zero, ky(a) > O if 1/n < -21, so large values of 5, are always critical 
if less than 21 per cent of observations are outliers. For Model B, where only the one- 
sided H,: A > 0 is relevant, small scale-shifts in outlier problems are always upward, 
and the locally most powerful test is based on large values of b, whatever the a’s may be 
(so that any number of outliers is permissible here). 

However, “locally ’’ most powerful means ‘“‘ near A = 0”’, so evidence is still re- 
quired of the efficiency of these tests for large shifts. Moreover, these tests have a 
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formidable rival, since Paulson (1952) and Kudo (1956) have shown that for Model A, 
with at most one outlier, the probability of correctly rejecting the outlier is maximized 
if we use as our criterion the value of the studentized extreme deviate (cf. ‘Thompson 


(1935), E. S. Pearson and Chandra Sekar (1936) ) 
i= Xin) — (0: z= <A) (32.39) 


S 


(where s? is the pooled estimate of o? from all available observations) for one-sided 
alternatives A > 0 or A < 0 respectively. ‘The same property holds for the studentized 
maximum absolute deviate 

fone = Imex ih i} (32.40) 
for two-sided alternatives A 4 0 in Model A and also (Ferguson, 1961) for A > 0 in 
Model B. 

Ferguson (1961) carried out sampling experiments on Model A based on 25,000 
random normal deviates successively assembled into samples of size m = 5(5)25, with 
one outlier differing from the other observations by o(c) 150. Of the one-sided tests, 
t, behaved slightly better than 1/b,, while 5, and f,,,, differed little as two-sided tests. 


The Biometrika Tables give :95 and :99 quantiles for /b,forn > 25 andofb,forn> 200. 
Ferguson (1961) estimates their quantiles by sampling experiments for m = 5 (5) 25. 
Quesenberry and David (1961) tabulate ¢, (tr) and tnx. 


Xr) — Ky 


32.25 Dixon (1950, 1951) considered ratios of form as rejection criteria 


X(n—s) — X1) 
and conducted sampling experiments to examine their powers. When o is known 
he found that the standardized extreme deviate 


—X K— Xx : 
a (0: uy, = ae (32.41) 
Go Oo 
and the standardized range 
= es eae, 


s (32.42) 


were about equally powerful. H. A. David et al. (1954) tabulate the percentiles of the 
studentized range 
a = X(n) Se (32.43) 

(32.43) is also tabulated, in the simplified case where its numerator and denominator are 
derived from independent samples, in The Biometrika Tables, by Pachares (1959) and 
by Harter (1960). (32.39) is tabulated with the same simplification in The Biometrika 
Tables, by Nair (1948, 1952), H. A. David (1956), Pillai (1959) and by Pillai and Tienzo 
(1959)—see also H. A. David and Paulson (1965). (32.40) is similarly tabulated by 
Halperin et al. (1955). Anscombe (1960) investigated the effect of rejecting outliers 
on subsequent estimation, mainly when o is known. Bliss et al. (1956) gave a range 
criterion for rejecting a single outlier among k normal samples of size n, with tables. 
Other criteria are discussed by Grubbs (1950) and Dixon (1950, 1951, 1953a). 

Dixon (1962, Chapter 10H of Sarhan and Greenberg (1962)) gives an extensive review 
of the ‘subject, including many of the tables referred to above, and some others. 

An obvious way of increasing the robustness of tests and estimators to the presence 
of outliers is to base them upon the “‘ central” part of the sample—e.g., the “‘ trimmed ”’ 
and ‘* Winsorized ”’ estimators of 32.18, and the tests in 32.22. 
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Non-normal situations 

32.26 One of the few general methods of handling the problem of outlying obser- 
vations is due to Darling (1952), who obtains an integral form for the c.f. of the dis- 
tribution of 


2, == (32.44) 


where the observations x; are identical independent positive-valued variates with a 
fully specified distribution. In particular cases, this c.f. may be inverted. Darling 
goes on to consider the case of y? variates in detail. Here, we consider only the simpler 
case of rectangular variates, where Darling’s result may be derived directly. 

Suppose that we have observations «,x.,...,x, rectangularly distributed on the 
interval (0,0). ‘Then we know from 17.46 that the largest observation x,,) is sufficient 
for 0, and from 23.12 that x;,) is a complete sufficient statistic. By the result of Exer- 
cise 23.7, therefore, any statistic whose distribution does not depend upon 6 will be 
distributed independently of xj). Now clearly z, as defined at (32.44) is of degree 
zeroin§. ‘Thus 2, is distributed independently of x,,, and the conditional distribution 
of 2, given X) is the same as its unconditional (marginal) distribution. But, given xj, 
any (i) (2 < m) is uniformly distributed on the range (0, x). Thus xq@/x@), given x(q), 
is uniformly distributed on the range (0, 1) and we see from (32.44) that z, is distributed 
exactly like the sum of (n—1) independent rectangular variates on (0, 1) plus the 
constant 1 (= X(m)/Xn)). 

Since we have seen in Example 11.9 that the sum of 2 independent rectangular 
variates tends to normality (and is actually close to normality even for x = 3), it fol- 
lows that 2, is asymptotically normally distributed with mean and variance exactly 
given by 

E(%,) = (n—1)3+1 = (n+), var 3, = (n—1) +4. (32.45) 
Small values of 2, (corresponding to large values of x,,)) form the critical region for 
the hypothesis that all ” observations are identically distributed against the alternative 
that the largest of them comes from an “ outlying ” distribution. 


32.27 Darling’s result may be used to test an “‘ outlier” for any fully specified 
parent by first making a probability integral transformation (cf. 30.36) of the observa- 
tions, thus reducing the problem to a rectangular distribution on (0,1). The smallest 
value x) may similarly be tested by taking the complement to unity of these rect- 
angular variates and testing x) as before. 


A. P. Basu (1965) discusses outliers for the exponential distribution. 


32.28 We must now refer to the possibility of using distribution-free methods to 
solve the “ outlier”? problem without specific distributional assumptions. It is clear 
that, if the extreme observations are under suspicion, this would automatically stultify 
any attempt to use an ordinary two-sample test based on rank order, such as were 
discussed in Chapter 31, for this problem. However, if we are prepared to make the 
assumption of symmetry in the (continuous) parent distribution, we are in a position 
to do something, for we may then compare the behaviour of the observations in the 


SOME USES OF ORDER-STATISTICS 531 


suspected ‘‘ tail’ of the observed distribution with the behaviour in the other “ tail” 
which is supposed to be well behaved. Thus, for large n, we may consider the absolute 
deviations from the sample mean (or median) of the k largest and k smallest observa- 
tions, rank these 2k values, and use a symmetry test to decide whether they may be 
regarded as homogeneous. The test will be approximate, since the centre of sym- 
metry is unknown and we estimate it by the sample mean or median, but otherwise 
this is simply an application of a test of symmetry (cf. 31.78-9) to the tails of the distri- 
bution. If 2 is reasonably large, and k large enough to give a reasonable choice of test 
size «, the procedure should be sensitive enough for practical purposes. 

Essentially similar, but more complicated, distribution-free tests of whether a group 
of 4 or more observations are to be regarded as “‘ outliers’? have been proposed by 


Walsh (1950b). 


32.29 Finally, we observe that if distribution-free methods of testing and estimation 
are used, they will generally be less affected by the presence of outliers than are methods 
based upon distributional assumptions, for they use order properties, rather than metric 
properties, of the observations, as we saw in Chapter 31. 


EXERCISES 


32.1 For a frequency function f(x) which takes the largest value at the median M, 
show that the ARE of the Sign test compared to ‘‘ Student’s’”’ t-test, given at (32.12), 
is never less than }, and attains this value when f(x) is rectangular. 


(Hodges and Lehmann, 1956) 


32.2 Show that if 2 independent observations come from the same continuous distri- 
bution F(x), any symmetric function of them is distributed independently of any function 
of their rank order. Hence show that the Sign test and a rank correlation test may be 
used in combination to test the hypothesis that (x) has median M, against the alternative 
that either the observations are identically distributed with median # M, or the median 
trends upwards (or downwards) for each succeeding observation. (Savage, 1957) 


32.3 Obtain the result (32.12) for the ARE of the Sign test for symmetry from the 
efficiency of the sample median relative to the sample mean in estimating the centre of a 
symmetrical distribution (cf. 25.13). 


32.4 In setting confidence intervals for the median of a continuous distribution using 
the symmetrically spaced order-statistics xj) and x(n—r+41), Show from (32.23) that for 
n = 30, the values of r shown below give the confidence coefficients shown : 


r 1-—« Y 1-—« 
8 0:995 12 0:80 
9 0:98 13 0:64 
10 0:96 14 0:42 
11 0:90 15 0:14 


32.5 Show that in testing the hypothesis Hy): = 9) against the alternative 
H,:0 = 0, > 9, for a sample of m observations from 


dF = texp{—|x—6|}dx, —o<x< 0, 
the one-tailed Sign test is asymptotically most powerful. (Cf. Lehmann, 1959) 


32.6 Show from Example 32.1 that the range of a sample of size n = 100 from a 
continuous distribution has probability exceeding 0-95 per cent covering at least 95 per 
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cent of the parent d.f., but that if we wish to cover at least 99 per cent with probability 
0-95, m must be about 475 or more. 


32.7 Show from Example 32.1 that if we wish to find a distribution-free tolerance 
interval for a proportion y of a continuous d.f., with probability 6 near 1, a small increase 


; ; : i 
in B from f, to P, requires an increase in sample size from mn, to my, = m( : 


l=; 
approximately. 
32.8 F(x|6) is any continuous df., F(a|6) =0, F(b|6)=1, and values 
hoy Airy + ++» Any, are defined by F(Ai| 05) = = where 0, is the true value of 8. Con- 


sider a sample of n observations, and divide the sample space into (n+1)” parts, with 
probabilities P,, 7 = 1, 2,...,(n+1)", by the planes a; = 4j,2 = 1,2,..."3;7 =0,1,..., 
n+1. If tis any asymptotically unbiassed estimator of 9, and ft, its value at some arbi- 
trarily chosen point in the rth of the (n+1)” parts of the sample space, show that, as 


n—> 0, 
n rs) 0 2)-1 
vart ~ (t, — 0)?P, > | n? & i abe : 
eta GaP a0 ) 
t+1 ‘ 
(Blom, 1958) 


32.9 Show that if dF'/00 is a continuous function of x within the range of x and tends 


F, 2 
e) > 0 exists, where f is the f.f. 


to zero at its extremities, and the integral EF ( 30 


OF (x |6 == ‘ 
sues then the asymptotic lower bound to the variance of an unbiassed estimator 
x 


given in Exercise 32.8 reduces to the MVB (17.24). 


32.10 Show that in estimating the mean of a rectangular distribution of known 
range, the result of Exercise 32.8 gives an asymptotic bound (2n?)~! for the variance, 
and that this is attained by the sample midrange t = }(xa)+X()). 


32.11 Show that if 0F/'/00, considered as a function of x, has a denumerable number of 
dlogf 
00 
is of order n~*. (Blom, 1958) 


2 
discontinuities, and E( exists, the asymptotic variance bound of Exercise 32.8 


32.12 For samples of size from the logistic distribution 
dF = e-*dx/G- + e*)}*, —-o<x< O, 
show that the c.f. of the rth order-statistic x7) is 
I (r+it)l (n—r4+1 —11) 
ee tere 


n—r+1 


so that, from (16.29), 1( sin —toe is distributed in Fisher’s z form (16.26) 


r 
with », = 27, », = 2(n—r+1); and hence that the cumulants of x, are, for r > 3(n+1), 


ri 4 2 ae 
Kia De op a ee ee 
s=n—r+1 § 3 s=18 s=15 
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r—1 1 y) 4 r—1 1 n—-T f 
[2 = ane (2 at Za). 
s=n—r+18 s=15 s=18 


(Plackett, 1958; A. Birnbaum and Dudman (1963) table the means and 
standard deviations for n = 1 (1) 10 (5) 20, 50, 100; S. S. Gupta and Shah 
(1965) table the first four moments, together with selected quantiles 
of the distributions, for all the order-statistics when m = 1 (1) 10, 
and quantiles only for the extreme and central order-statistics for 
n= 11 (1) 25. Tarter and Clark (1965) give formulae for the 
covariances of logistic order-statistics, which are tabulated for n<10 
by B. K. Shah (1966).) 


32.13 A continuous distribution has d.f. F(x) and & is defined by F(&) = p. Show 
by expanding x in a Taylor series about the value E{xi;)} that in samples of size n, as 
n—>o with r = [np], 

E(x) = p+ O(n”) 


and that 1 
FE {x41} ]-F[Efxm}] = = +O(n-*). (Plackett, 1958) 


32.14 Show that in samples from a normal population, if we estimate the mean by 
t = $ (xa +X(n-i+1), 
we obtain minimum variance asymptotically by choosing 7 = 0:2703n. Similarly, if we 


estimate o by 1 
$ = 5 (x(n—i+1)— XO); 


show that we obtain minimum variance when 7 = 0-0692n. 


(Benson, 1949; the results were first given by Karl Pearson 
in 1920—cf. also Moore (1956) ) 


32.15 Show (cf. (32.37-38)) that censoring has the effect of attaching a discrete 
probability to the original parent distribution f(x |6) in each range where censoring 
takes place. Hence show that, e.g. for Type I censoring above a fixed point xo, the 
asymptotic variance of the ML estimator of @ is 


ent t/a ("ey rea) }) 


[o.) 
where P= | fdx. 
Xo 


(cf. Halperin (1952a) ) 


32.16 Show that for a sample of (n+m) observations from 
dF = 0e—®, O0O<x< wo; 0>0, 
n of which are measured between 0 and a fixed value x9, and m of which are Type I 
censored, having values greater than x9, the ML estimator of @ is 


n 
i= n/4 p> so+meh 
a | 
ae 
and that var 6 ~5/ —exp (— 6x9) }, 


so that the asymptotic variance of 8 is a monotonic decreasing function of Xp. 
(Deemer and Votaw, 1955) 
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32.17 In Exercise 32.16, show that if the censoring is Type IJ, with a fixed number 
m of the sample order-statistics censored on the right, show that the ML estimator of 


A = 1/0 is Fe 1 n 
A= mt >> xo+mae| 
n\ t=1 


and that its asymptotic variance is 
var A ~ A*/n. (Epstein and Sobel, 1953) 


32.18 In Exercise 32.17, show that the variance of 4 is exact for any value of n 
and that A is the MV unbiassed linear estimator of 4. (Sarhan, 1955) 


32.19 Show that if a Poisson distribution with parameter 0 is truncated on the right, 
there exists no unbiassed estimator of 0. (Tate and Goen, 1958) 


32.20 If values <a are removed by truncation, show that the parameter 6 of a 
Poisson distribution may be estimated unbiassedly by 


where n, is the frequency observed at the value r. Show that 
pati 


3S 2 Soo ee s 
var = a(e 3 A 


and that an unbiassed estimator of this variance is 


“N 1 (a+1) (a+2)na+.2 
var is t+ Se 


n 


(Subrahmaniam (1965) who shows that t has efficiency 2 90 per cent 
compared to the ML estimator when 6< 1. If a = 0, t is Plackett’s 
(1953) estimator and has efficiency > 95 per cent for all 6—cf. also 
J. Roy and Mitra (1957).) 


32.21 For a sample of m observations, x,, show that 


n a— i 2 
DL x2 = n+ DY ——_, 
ae r=17(r+1) 
r 
where Zp = 1X r41)— 23 XG); 212, ...,8 1, 
s=1 


and hence, applying the Helmert transformation of Example 11.3 to the 7 order-statistics 
in the form 

Vr = 2r/{r(r+1)}, = L Pans aipteess Yn = n? x, 
show that in samples from a standardized normal population the joint frequency function 
of the 2; is 


n—1 
rh Qa)-H—Dexp | —} py saree+ i}, @ #8; Says oS Og = B®. 
r=1 


Defining the functions 


x 2 
G; (x) = [ex{ ~37G40 hor dt, Gi) = 1, 
show that the distribution function of u = (x,—), say Pn(u), satisfies the relationship 
Py(u) = ni (22)—-2@—-D G,_, (nu). 
(Nair (1948); McKay (1935) had obtained an equivalent result) 
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32.22 In Exercise 32.20, show that the ML estimator is a root of 
6/i—e*) =% 
where x is the observed mean, and that 
ae 
var 0 a —e—)?/ {1 —(0+1) e—9}. 


Hence show that 


lim n var 6 = 2, lim n var 6 = 1, 
. 6—>0 6—> 0 
and that var 6 never lies outside these limits. (cf. Cohen, 19600) 


32.23 Show that the ML estimator in Exercise 32.22 may be expressed as 


5 é co (—1) x" a’—1 : 00 yr—t aS 
Okt ee ae fm 8 a or! (e*/. 


(Irwin, 1959) 


32.24 In Exercise 32.20, show that an estimate of the number of observations including 
the missing values is given by the ratio (X x)?/X.x(x—1), where the summations are over 
values from 1 to ©. Hence show how @ may be estimated iteratively. 

(Irwin, 1959; the method is due to A. G. McKendrick) 


32.25 In Type II censored samples from the rectangular distribution dF = dx/o, 
u—to < x < w+, show from (32.38) that the extreme observed values xr,41, Xn—r, 
are a pair of sufficient statistics for w and 0. Hence show that the MV unbiassed linear 
estimators of the parameters are 

x (n —2re—1)xXr4y + (n- 2 - 1) x(n —r4) 
> 2(n—171—12—1) : 
s= EEE | (x — x ) 
btideiee= (08 (y+1))» 
with variances 
o* {(r, +1) (n—2r,—1) + (m— 27, —1) (72 +1) ny 
4A(n+1)(n+2)(n—171—172—1) 

o7 (ry +17.+2) 

(n+2)(n—1r,—1r2g—-1)’ 
thus generalizing Example 19.10. 


var m = 


vars = 


(Sarhan and Greenberg, 1959) 


32.26 Referring to 18.16, show that if the parent distribution F(x | 6) is truncated 
at points a below, b above, thie reciprocal of the asymptotic variance of 8 becomes 


(2 b 2 
wiv ~ olla) #/ J e}-i(Jaee) /( 74) fh 
a 
tending to the previous definition (18.30) as a—»— 0, b—> +0. Show that 


b 
R® « (0) —Rzp(0) is not necessarily positive, but that R20 (0)— ( fax) Ri O}>0 
a 


always, so that we may expect some increase in the asymptotic variance of the ML esti- 
mator unless truncation is very severe. 


(cf. B. R. Rao (1958a)) 


MM 


CHAPTER 33 
CATEGORIZED DATA 


33.1 For most of this book, we have been concerned with the analysis of measured 
observations, but in Chapter 31 we investigated the use of rank order statistics in testing 
hypotheses ; such statistics may be constructed from measured observations or, alter- 
natively, may be obtained from the ranks of the observations if no measurements are 
available or even possible. In the present chapter, we shall discuss categorized data ; 
by this we mean data which are presented in the form of frequencies falling into certain 
categories or classes. We have, of course, discussed the problems of grouping variate- 
values into classes, e.g. in connexion with the calculation of moments. ‘There, how- 
ever, the grouping was undertaken as a matter of computational convenience or neces- 
sity, and we were in any case largely concerned with univariate situations ; here, we 
specifically confine ourselves to problems arising in connexion with the statistical 
relationships (whether of dependence or interdependence) between two or more “ vari- 
ables ” expressed in categorized form. We have put quotes on the word “ variables ”’ 
because it is to be interpreted in the most general sense. 

A categorized “ variable ” may simply be a convenient classification of a measurable 
variable into groups, in the manner already familiar to us. On the other hand, it may 
not be expressible in terms of an underlying measurable variable at all. For example, 
we may classify men by (a) their height, (b) their hair colour, (c) their favourite film 
actor ; (a) is a categorization of a measurable variable, but (b) and (c) are not. There 
is a further distinction between (b) and (c), for hair colour itself may be expressed on 
an ordered scale, according to pigmentation, from light to dark ; this is not so for (c). 
Although, of course, one could impose various types of classifications upon the film 
actors named, the actors are intrinsically not ordered in any way. We refer to (b) as 
an ordered classification or categorization, and (c) as an unordered one. As an extreme 
case of an unordered classification, we may consider a classification which is simply a 
labelling of different samples (which we wish to compare in respect of some other 
variable). 


33.2 There is a further point to be borne in mind: on occasion, the two variables 
being investigated may simply be the same variable observed on two different occa- 
sions (e.g. before and after some event) or on two related samples (e.g. father and son, 
husband and wife, etc.). We shall refer to such a situation as one with identical cate- 
gorizations. Identical categorizations may, of course, be of any of the types (a), (b) 
or (c) in 33.1. | 


Association in 2 x 2 tables 
33.3 Historically, a very large part of the literature on categorized variables has 
been concerned with the problems of measurement and testing of the interdependence 


of two such variables, or, as it is generally known, the problem of association. We 
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leave aside entirely the problem of estimating interdependence in the case where the 
form of underlying measurable variables is known or assumed—this has been dis- 
cussed for the bivariate normal case in 26.27-33. In other words, we confine ourselves 
to non-parametric problems. 


33.4 Consider in the first place a population classified according to the presence 
or absence of an attribute A. ‘The simplest kind of problem in interdependence 
arises when there are two attributes A, B, and if we denote the absence of A by « and 
the absence of B by f, the numbers falling into the four possible sub-groups may, 
in an obvious notation, be represented by 


B not-B | ‘Toras 
A (AB) (Ap) | (A) 
oe eee, | ee, 
Totats | (B) (B) n 


We shall often write this 2x2 table (sometimes called a fourfold table) in a form 
which has already occurred at (26.58) : 


a b a+b 
< d c+d (33.2) 


ate b+d\- on 


If there is no association between A and B, that is to say, if the possession of A is 
irrelevant to the possession of B, there must be the same proportion of A’s among 
the B’s as among the not-B’s. Thus, by definition, the attributes are independent(*) 
in this set of m observations if 


a ee a (33.3) 
It follows that 


aS d c+d 

oie. b45 5° 
ik ee ETC 

ee n (33.4) 
b d bid 


ee ieee 


(*) It would perhaps be better to use a neutral word like ‘“‘ unassociated ”’ rather than ‘‘ inde- 
pendent ”’ to describe the relationship (33.3), since it does not imply (though it is implied by) 
the stochastic independence of numerical variables which may have generated the 2 x 2 table. 
The distinction we are making is precisely analogous to that between lack of correlation and 
independence—cf. 26.10. However, historical usage is against us here, and in any case there is 
some danger of confusion between “‘ unassociated ’’ and “‘ dissociated,’’ to be defined at the 
end of 33.4. We shall therefore continue to use ‘“‘ independence,”’ as applied to categorized 
variables, to mean “ lack of association.” 
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(33.3) may be rewritten 


fae ee (33.5) 
If now, in any given table, 
ms be Bes (33.6) 


there are relatively more A’s among the B’s than among the not-B’s, and we shall 
say that A and B are positively associated, or simply associated. Per contra, if 


ae eB (33.7) 


we shall say that A and B are negatively associated or dissociated. 


Example 33.1 

The following table (Greenwood and Yule, 1915, Proc. Roy. Soc. Medicine, 8, 
113) shows 818 cases classified according to inoculation against cholera (attribute A) 
and freedom from attack (attribute B). 


Not attacked Attacked | ToTALs 


Inoculated 276 3 279 
Not-inoculated 473 66 539 
"TOTALS 749 69 818 


If the attributes were independent, the frequency in the inoculated-not-attacked 
279 x 749 
818 
hence inoculation is positively associated with exemption from attack. 


class would be = 255. ‘The observed frequency is greater than this and 


Measures of association 

33.5 If we are to sum up the strength of association between two attributes in 
a single coefficient, it is natural to require that the limits of variation of the coefficient 
should be known, and that it shall take the central value or the lowest value of its range 
when there is no association (“ independence”). We may always make location and 
scale changes in any coefficient to bring it within the range (—1, +1); independence 
should then correspond to a value of zero for the coefficient. This convention has 
the advantage of agreeing with the properties of the product-moment correlation 
coefficient (cf. 26.9). Alternatively, the range (0,1) may be preferred, zero being the 
value taken in the case of independence. 

Another obvious desideratum in a measure of association is that it should increase 
as the relationship proceeds from dissociation to association. Consider the difference 
between observed and “ independence ” frequencies in the cell corresponding to (AB), 


ie oe = ote (33.8) 
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For constant marginal frequencies, it is evident that the difference in any cell between 
observed and “independence” frequencies is + D and thus D determines uniquely 


the departure from independence. We thus require that our coefficient should increase 
with D. 


33.6 Following Yule (1900, 1912) we define a coefficient of association, Q, by 
the equation 


o- ad—be nD (33.9) 


ad+be  ad+be 
It is zero if the attributes are independent, for then D = 0. It can equal +1 only 
if bc = 0, in which case there is complete association (either all A’s are B’s or all B’s 
are A’s), and —1 only if ad = 0, in which case there is complete dissociation. Further- 
more, Q increases with D, for if we write e = bc/(ad), we have 


O = (l-e)/(1+e) = 2/(1+e)-1, 


ee 
eae 


so that 


dD . ee. = 
and as 7 8 also negative, ap ' positive. 


Yule also proposed a so-called coefficient of colligation 


1-(45) 
ad} _ (ad)!—(bc)} 1 
— 1 (35) (ad) + (bc)? ue 
+(— | 

ad 

but it is easy to show that 
Bg 
enacts Al 
g 1+ Y? aie 


and nothing much seems to be gained by the use of Y. It is easily seen to satisfy 
our conditions. 
Yet a third coefficient, to which we shall return in 33.17 below, is 
ae (ad—bc) 
{(a+b)(a+c)(b+d)(c+d)}* 
This is evidently zero when D = 0 and increases with D. If V2 = 1, we have 


(a+b)(a+c)(b+d)(c+d) = (ad—bc)?, 


(33.12) 


giving : 
4abcd + a® (be + bd+ cd) +b? (ac+ ad+cd)+c*®(ab+ad+bd)+d*?(ac+ab+bc) = 0. 

Since no frequency can be negative, this can only hold if at least two of a,b,c,d are 

zero. If the frequencies in the same row and column vanish, the case is purely nuga- 


tory. We have then only to consider a= 0, d=0 or }=0, c= 0. In the first 
case V = —1, in the second V = +1. It cannot lie outside these limits. 
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33.7 It will be observed that whereas |V| = 1 only if two frequencies in the 
2x2 table vanish, |Q| and | Y| are unity if only one frequency vanishes. This raises 
a point in connexion with the definition of complete association. We shall say that 
association is complete if all A’s are B’s, notwithstanding that all B’s are not A’s. 
If all dumb men are deaf, there is complete association between dumbness and deafness, 
however many deaf men there are who are not dumb.(*) The coefficient V is unity 
only if all A’s are B’s and all B’s are A’s, a condition which we could, if so desired, 
describe as absolute association. 

“We must point out in this connexion that statistical association is different from 
association in the colloquial sense. In current speech we say that A and B are associated 
if they occur together fairly often ; but in statistics they are associated only if A occurs 
relatively more frequently among the B’s than among the not-B’s. If 90 per cent 
of smokers have poor digestions, we cannot say that smoking and poor digestion are 
associated until it is shown that less than 90 per cent of non-smokers have poor digestions. 


Standard errors of the coefficients 


33.8 We now consider the 2 x2 table as a sample, and derive the standard errors 
of the coefficients of 33.6 on the hypothesis of independence. We have, writing @ for 
the differential to avoid confusion with d, 


de ab , dc Ca Od 


ee Sr ae 


whence 
vare ,varu , cov (u, v) 1 
pe a Ete ft |, (33.13) 
where u and v are to be summed over a,b,c,d. Using multinomial results typified by 
vara = melee: 
n 
b 
eS 
cov (a, b) me 


we find, on substitution in (33.13), 


—— e(itstcty) | (33.14) 


It is then easy to derive 


varQ = i(1- of — peta p (33.15) 


a 2\2 ue oe: 
var Y =(1- Se So \ (33.16) 


(*) If this asymmetrical convention is followed, complete association between A and B is not, 
in general, the same as complete association between B and A. 


CATEGORIZED DATA 541 


The sampling variance of V may be found similarly, but involves rather more lengthy 
algebra. We have 


1 (a—d)?—(b—c)? 
es i 
= =| +U +E) Geb) (a+e)(b+d)(era)}t 
3 v2 eG ~ Grabady | 
4 (a+b)(c+d) (a+c)(b+d) 
These formulae assume, as usual in large-sample theory, that the observed frequen- 
cies may be used instead of their expectations in the sampling variances. 


(33.17) 


Partial association 

33.9 The coefficients described above measure the interdependence of two attri- 
butes in the statistical sense, but in order to help us to decide whether such dependence 
has any causal significance it is often necessary, just as in 27.1 for correlations, to con- 
sider association in sub-populations. Suppose, for example, that a positive association 
is noticed between inoculation and freedom from attack. It is tempting to infer that 
the inoculation confers exemption, but this is not necessarily so. It might be that the 
people who are inoculated are drawn largely from the richer classes, who live in better 
hygienic conditions and are therefore better equipped to resist attack or less exposed 
to risk. In other words, the association of A and B might be due to the association 
of both with a third attribute C (wealth). We therefore consider the association of 
A and B conditional upon C being fixed. 

Associations in sub-populations are called partial associations. Analogously to 
(33.6), A and B are said to be positively associated in the population of C’s if 


(ABC) > a (33.18) 


(C) 

where (ABC) represents the number of members bearing the attributes A, B and C; 

and so on. We may also define coefficients of partial association, colligation, etc., 
such as 

0 _ (ABC) (aBC)—(ABC) («BC) 

a&© (ABC) (aBC)+(ABC) (aBCY 

which is derived from (33.9) by adding C to all the symbols representing the fre- 

quencies. 


(33.19) 


Example 33.2 


The following example, though not in line with modern genetical thought, has a 
historical interest as showing some early attempts at discussing heredity in a quanti- 
tative way. 

Galton’s Natural Inheritance gives particulars, for 78 families containing not less 
than six brothers or sisters, of eye-colour in parent and child. Denoting a light-eyed 
child by A, a light-eyed parent by B and a light-eyed grandparent by C, we trace every 
possible line of descent and record whether a light-eyed child has light-eyed parent 
and grandparent, the number of such being denoted by(ABC) and soon. The symbol 
(Afy), for example, denotes the number of light-eyed children whose parents and 
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grandparents have not-light eyes. The eight possible classes are 


(ABCs = 1928. («BC) = 303 
(ABy) = 596 (aBy) = 225 
(ABC) = 552 (BC) = 395 
(ABy) = 508 (aBy) = 501. 


The first question we discuss is: does there exist any association between parent and 
offspring with regard to eye-colour ? We consider both the grandparent—parent group 
(association of B’s and C’s) and the parent-child group (association of A’s and B’s). 

The proportion of light-eyed among children of light-eyed parents is (BC)/(C) 
= 2231/3178 = 70-2 per cent. That of light-eyed among children of not-light-eyed 
parents, (By)/(y), is 821/1830 = 44-9 per cent. Likewise (AB)/(B) = 82-7 per cent 
and (Af)/(f) = 54:2 per cent. Evidently there is some positive association in this 
set of observations between parent and offspring in regard to eye-colour. 

Consider now the relationship between eye-colours of grandparents and grand- 
children. The proportion of light-eyed among grand-children of light-eyed grand- 
(AC) _ 2480 
1Cy, 3128 
eyed grandparents, (Ay)/(y) is 1104/1830 = 60-3 per cent. 

Thus the association between eye-colour in grandparents and grand-children is 
also positive. In tabular form, the data are: 


parents is = 78:0 per cent. That among grand-children of not-light- 


Attributes A o Totats | Attributes Cc y ‘TOTALS Attributes C Y TOTALS 
B 2524. 528 + 3052 B Z201-? $2¢ 3052 A 2480 1104 3584 
B 1060 896 1956 p 947 1009 1956 o 698 726 1424 

Totats 3584 1424 5008 3178 1830 5008 3178 1830 5008 


The coefficients of association and colligation O and Y are 


O Y 
Grandparents—parents 3 .. 0:487 0-260 
Parents—children = : + 5P6O3 0-336 
Grandparents—grand-children .. 0-401 0-209 


Now the question arises: is the resemblance between grandparent and grand- 
child due merely to that between grandparent and parent, parent and child? To 
investigate this, we consider the associations of grandparent and grand-child in the 
sub-populations “ parents light-eyed”’ and “ parents not-light-eyed”’; that is, the 
associations of A and C in B and f. 

Among light-eyed parents, the proportion of light-eyed amongst grand-children 
of light-eyed grandparents = (ABC asd Eis 86-4 per cent, while the proportion 


‘$2.53 Boer os 
of light-eyed amongst grand-children of not-light-eyed grandparents = Ry ~ ses 


= 72-6 per cent. 
Among not-light-eyed parents, the proportion of light-eyed amongst the grand- 
(4A8C) 2552 


ehildren of light-eyed grandparents = (BC) aro = 58-3 per cent, and the pro- 
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portion of light-eyed amongst the grand-children of not-light-eyed grandparents 
Se 50-3 per cent. 

In both cases, the partial association is well marked and positive. ‘The association 
between grandparents and grand-children cannot, then, be due wholly to the associa- 
tions between grandparents and parents, parents and children. ‘This was interpreted 
to indicate the existence of ancestral heredity, as it is called, as well as parental heredity. 
The relevant tables are: 


Table 33.1 
Parents light-eyed Parents not-light-eyed 
Grandparents Grandparents 

¢ BC By | Torats ¢ BC By | ToTaus 
= : as) | 2 
a AB 1928 596 | 2524 7 AB 552 508 | 1060 
1S) 12) 
s aB 303 225 428 % ap 395 501 896 
= | = — ee 4 == 
se TOTALS | 2231 821 | 3052 Totats| 947 1009 | 1956 


The coefficients of association and colligation are: 
Qac.p = 0-412, QOac.p = 0-159, 
Yuc.p — 0-216, Y 4c.p oe 0-080. 


33.10 If there are p different attributes under consideration, the number of partial 
associations can become very large, even for moderate p. For example, we can choose 


two in § ) ways and consider their associations in all the possible sub-populations 


of the other (p—2), which are seen to be 3?-? in number. ‘Thus there are (5) 37-2 


associations. 
One of the principal difficulties, in fact, in discussing data subject to multiple 


dichotomy (and, even more, multiple polytomy) is the sheer volume of the large number 


of tables which results. 
One result in this connexion is worth noticing. We have, generalizing D in equa- 


tion (33.8), 
= Ue) _{4y) (By) 
Disot+Dazy = se }+{(4By) od ac 
=e eee 
ABCA) ANeeY 


gg pyr acne: (33.20) 
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If, then, A and B are independent in both (C) and (y), Daz.c = Daz.y = 0 and 
(33.20) gives 


Dap = NC (Cy PAc Pao (33.21) 
1.e., A and B are not independent in the population as a whole unless C is independent 
of A or B or both in that population. Compare (27.67) for partial correlation, where 
it follows that py. = 0 only implies p,, = 0 if pi3 or pos = 0. 

-This result indicates that illusory associations may arise when two populations 
(C) and (y) are amalgamated, or that real associations may be masked. If A and C, 
B and C, are associated, we have, from (33.20) 


Das = ONG (joy Oe oF apy (33.22) 


so that if A and B are associated positively in (C) and negatively in (y), Dag may be 
zero, that is to say, A and B may appear independent in the whole population. 


Example 33.3 

Consider the case in which some patients are treated for a disease and others not. 
If A denotes recovery and B denotes treatment, suppose that the frequencies in the 
2x2 table are: 
B §8 | TorTa.s 


A | 100 200. 300 
: 50 100 150 


Torats 150 300 450 


Here (AB) = 100 = MS so that the attributes are independent. So far as can 


be seen, treatment exerts no effect on recovery. 
Denoting male sex by C and female sex by y, suppose the frequencies among males 
and females are: 


Males Females 
| BC BC | Torats By By Torars: 
AC 2 200 160 = ae 
aC 40 80 120 ay —. 410 20 30 
Torats | 120 180| 300 ‘Torars, 30 120 150 


In the male group we now have 


_ (80 x 80) —(100 x 40) 
Qaz.c = (80 x 80) + (100 x 40) 


and in the female group 


= 0-231, 


On» — — 0-429. 


CATEGORIZED DATA 545 


Thus treatment was positively associated with recovery among the males and nega- 
tively associated with it among the females. The apparent independence in the 
combined table is due to the cancelling of these associations. 

More paradoxically, two tables may have associations of the same sign and yet, 
when merged, form a table with association of the opposite — reader may like to 
experiment numerically to produce such a result. 


Probabilistic interpretations of measures of association 

33.11 ‘The measures of association discussed in 33.5-10 were developed from 
1900 onwards, and set out to summarize the strength of association in a single compre- 
hensive coefficient. But, just as we saw in 26.10 for the correlation coefficient, it is 
not always reasonable to suppose that any single coefficient can do this adequately. 
Goodman and Kruskal (1954, 1959), in addition to giving a detailed discussion of the 
history of measures of association and a very full bibliography, make a powerful plea 
for choosing a measure which is interpretable for the purpose in hand. Most of their 
discussion is couched in terms of polytomized tables, i.e. tables with two or more cate- 
gories in the row and column classifications, and we shall refer to their work in 33.35, 
33.40 below when considering polytomies. Here, however, we remark that in a 
2x2 table the coefficients QO and Y defined at (33.10-11) can be given operational 


interpretations, under certain conditions. 


33.12 Consider the selection at random of two individuals from a population of 
n individuals classified into a table (33.1), so that each individual will fall into one of 
the four categories in the body of the table. Let us score for i = 1,2, 
= +1 if an individual possesses A, 
a; : 
=  Q otherwise, 
= +1 if an individual possesses B, 
‘| = 0 otherwise, 
and ao — A,;—Ag, by = b,—),. 
Define the probabilities 
= P{ayb, — 1}, Ua = P {agbo == —1}, TU ae P {ag bo = 0}. 
Then the coefficient 
Us — Ua 
1-2, 
is the probability z,/(1—2,) that the two individuals selected from the population have 
their A- and B-categories different and in the same order (if we sense the table (33.1) 
from left to right and top to bottom) minus the probability z,/(1—2;) that they have 
different A- and B-categories in opposite orders. Clearly, using the notation (33.2), 
y = (ad—bc)/(ad+bc) = O 
as defined at (33.9). Q therefore has a direct probabilistic interpretation as above. 


y = (33.23) 


33.13 Similarly, consider choosing a single individual at random from the 
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population, and suppose that we are asked (without prior knowledge) to guess whether 
it is A or not-A. The best estimate we can make is to guess that the individual 


comes from the larger of the frequencies (A), («) in (33.1); the probability that this 
estimate is correct is, from (33.2), 


Pm. = * max (a+b,c+d), 


Similarly, if we have to guess whether the individual is B or not-B, we guess the larger 
of (B), (6), with probability of success 


Pm = * max (a+c,b+ da). 


Thus if we are asked to guess the A-category half of the time and the B-category the 
other half of the time, the probability of success is 2(Pm.+ Pm) and the probability of 
error 1s 
Ty = 1-3 (Pm. + P.m)- (33.24) 
Suppose now that we know the individual’s B-category and are asked to guess the 
A-category. The best guess is now the larger category im the appropriate column of 
max(a,c) _ max(b, d) 
or 
a+c b+d 
columns. Since these columns will occur in random sampling with probabilities 
(a+c)/n, (6+d)/n respectively, the overall probability of success in guessing the A- 
category given the B-category ‘is 


the table, with probability of success equal to 


in the respective 


= + {max (a, c)-+max(B, d)}. (33.25) 


The overall probability of success in guessing the B-category given the A-category 
will similarly be 


re * {max (a,b) + max (c, d)}. (33.26) 


If, as in the no-information situation above, we had to guess the categories alternately, 
the probability of success would be the mean of (33.25) and (33.26) and the probability 
of error therefore would be 


a —5-{max (a, b) + max (a, c) + max (b, d) + max (c, d)}. (33.27) 


33.14 We now define the coefficient, from (33.24) and (33.27), 
wy SOT Es 2 
A — (33.28) 
which is the relative reduction in error-probability produced by knowledge of one 
category in predicting the other. Clearly, 2) > 2, so we have 


O</ <li. (33.29) 
Now, Yule (1912) suggested that the value of a “ reasonable” measure of association 
should not be affected if each row and each column of the 2x2 table is separately 
multiplied through by an arbitrary positive constant. Using this invariance principle, 
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we multiply the table (33.2) by the constants : 
First row:  (cd)*/n, First column: (bd)? /n, 
Second row: (ab)? /n, Second column : (ac)*/n, 
and transform it (neglecting constants common to all four frequencies) to 
(ad)? (bc)? | 3m 
(bc) (ad)? | 4m (33.30) 


1 
sm im | m 


Further obvious multiplications show that any coefficient of association satisfying 
the invariance principle must be a function of the “‘ cross-ratio ” ad/be alone. Q and 
Y of (33.9-10) obviously satisfy this condition, but V of (33.12) does not. Edwards 
(1963) derives this condition from other propositions. 

Yule’s invariance principle enables us to relate A at (33.28) to Y. For the moment, 
suppose ad > be. From (33.28), we then have for the transformed table (33.30), 

| : =: A 

2 E = (ad)*| 


= 
z 
_ 2(ad)i—3m 
- 2 tae 
which since (ad)?+(bc)? = 3m becomes 
s: (ad Gey 
"keel aso (33.31) 


(33.31) is identical with the definition of Y at (33.10), but we have chosen its sign arbi- 
trarily by taking ad > be. ‘Thus, generally, i : 
A= |Y|, (33.32) 


conferring a probabilistic interpretation upon the magnitude of Y. 


Large-sample tests of independence in a 2 x 2 table 

33.15 We now consider the observed frequencies in a 2x2 table to be a sample, 
and we suppose that in the parent population the true probabilities corresponding to 
the frequencies a, b, c, d are pis, Piz) Par Pre respectively. We write the probabilities 


Pir Pre | Pr. 
Por Poe | Po. (33.33) 
Pai fDsy-\.4 


with p; = ~11+P12, and so forth. We suppose the observations drawn with replace- 
ment from the population (or, equivalently that the parent population is infinite). 
We also rewrite the table (33.2) in the notationally symmetrical form 
My, = Mya | 1. 
Ms; gg | Ms. 


N43 No n 
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The distribution of the sample frequencies is given by the multinomial whose 
general term is 
| 
L — a M11 Pra PN P22, : 
N11! No! No! Pam s 12 Poi Ps (33 34) 
To estimate the p;;, we find the Maximum Likelihood solutions for variations in 
the p,; subject to Xp,; = 1. If Ais a Lagrange multiplier, this leads to 


ad eS — 0 or N14 — APi 


P11 
and three similar equations. Summing these, we find 4 = n and the proportions p,; 
are simply estimated by 
Pir = M3/n (33.35) 
and three similar equations. This is as we should expect. The estimators are un- 


biassed. 
We know, and have already used the fact in 33.8, that the variances of the n,; are 


typified by 
vary, = Npy,(1—Ppis) 

and the covariances by 

COV (141,12) = —MPirPi2- 
These are exact results, and we also know (cf. Example 15.3) that in the limit the joint 
distribution of the m,; tends to the multinormal with these variances and covariances. 
We may now also observe that the asymptotic multinormality follows from the fact 
that these are ML estimators and satisfy the conditions of 18.26. 


33.16 Now suppose we wish to test the hypothesis of independence in the 2 x 2 

table, which is 
Hq: Pir Pox = Piz Par- (33.36) 
This hypothesis is, of course, composite, imposing one constraint, and having two 
degrees of freedom. We allow p,, and p,, to vary and express po, and pz. by 
= Pull —P11—P12) = Pr2(1 —P11—P12) 

se PutPn rae PutPis ey 
The logarithm of the Likelihood Function is therefore, neglecting constants, 

log L = 1j, log p11 +My, log py. +21 log po; + M2 log po. 
Ny, log P11 +12 log py_ + M2; {log py, + log (1 — pis — P12) — log (P11 + P12) } 

+ M2 {log pi2+ log (1 —pi1—P12) — log (Pit Pie) } 

n.log pi, +1.2logpi2.+m, {log (1 —p1,.) —logp, }. 
To estimate the parameters, we put 


dlogL n. { 1 1 } Nn, is: 

0 = ee ee ee : 33.38 
OP Pu : tf, Pi. Pu fi. (1 =P) 
= OlogL _n. tte, (33.39) 


OPi2 ae Pi. (1—p,.) 
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giving for the ML estimators under Hy 
Pu = cbs eee (33.40) 


(33.37) gives analogous expressions for po and poo. Thus we estimate the cell prob- 
abilities from the products of the proportional marginal frequencies. This justifies 
the definition of association by comparison with those products in 33.4-5. 
Substituting these ML estimators into the LF, we have 
L (ni;| Ho, Pii) oC (my, 1,1)" (my. 0, o)"* (Mg, 1.1)"" (Ma. 2. 2) /n™, (33.41) 
while the unconditional maximum of the LF is obtained by inserting the estimators 
(33.35) to obtain 
L (mis| pis) mip my mae meg /n". (33.42) 


(33.41-2) give for the LR test statistic 


ya (MB m1 (ny, m,.\" (Me, 1.1\"" (M2. 1.2 \"™ (33.43) 
ee NN1 NNo1 NNo ) 
Writing np, = 1; n.;/n = 3, this becomes 
(esperar (ay eae 
N11 Nis No Nee 
33.17. The general result of 24.7 now shows that —2log/ is asymptotically dis- 


tributed as 7? with one degree of freedom. This is easily seen as follows. Writing 
D, = niy—e; (cf. (33.8)), and expanding as far as D?(= D3, all i, j), we have 


——- 2 
i=1 j=1 e 


ig] \ esi eij 
= rae ~. (33.45) 
(33.45) may be rewritten 
—2logl = i ee = X?, (33.46) 


We have thus demonstrated in a particular case the asymptotic equivalence of the LR 
and X2 goodness-of-fit tests which we remarked in 30.5. (33.46) could have been 
derived directly by observing that the composite H, implies a set of hypothetical fre- 
quencies e;;, and that the test of independence amounts to testing the goodness-of-fit 
of the observations to these hypothetical frequencies. As in 30.10, the number of 
degrees of freedom is the number of classes (4) minus 1 minus the number of parameters 
estimated (2), i.e. one. 

It is a simple matter to show that the X? statistic at (33.46) is identically equal to 
nV2, where V is the measure of association defined at (33.12). We leave this to the 


reader. 


Exact test of independence: models for the 2 x 2 table 
33.18 The tests of independence derived in 33.15-17 are asymptotic in , the 
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sample size. Before we can devise exact tests of independence in 2x2 tables, we 
must consider some distinctions first made by Barnard (1947a, b) and E. S. Pearson (1947). 

It will be recalled that the expected values in the cells of the 2x2 table on the 
hypothesis of independence of the two categorized variables are 


eine me, bf 2} (33.47) 
depending only on the four marginal frequencies and upon the sample size, m. Since 
We. are now concerned with exact arguments, we must explicitly take account of the 
manner in which the table was formed, and in particular of the manner in which the 
marginal frequencies arose. Even with n fixed, we still have three distinct possibilities 
in respect of the marginal frequencies. Both sets of marginal frequencies may be 
random variables, as in the case where a sample of size is taken from a bivariate dis- 
tribution and subsequently classified into a double dichotomy. Alternatively, one set 
of marginal frequencies may be fixed, because that classification is merely a labelling 
of two samples (say, Men and Women) which are to be compared in respect of the other 
classification (say, numbers infected and not-infected by a particular disease). If the 
numbers in the two samples are fixed in advance (e.g. if it is decided to examine fixed 
numbers of Men and of Women for the disease), we have one fixed set of marginal 
frequencies and one set variable. When we are thus comparing two (or more) samples 
in respect of a characteristic, we often refer to it as a test of homogeneity in two (or k) 
samples. 

Finally, we have the third possibility, in which both sets of marginal frequencies 
are fixed in advance. ‘This is much rarer in practice than the other two cases, and the 
reader may like to try to construct a situation to which this applies before reading on. 
The classical example of such a situation (cf. Fisher (1935a) ) concerns a psycho-physical 
experiment : a human subject is tested m times to verify his power of recognition of 
two objects (e.g. the taste of butter and of margarine). Each object is presented a cer- 
tain number of times (not necessarily the same number for the two objects) and the 
subject 1s informed of these numbers. ‘The subject, if rational, then makes the marginal 
frequencies of his assertions (“‘ butter”’ or “ margarine’’) coincide with the known 
frequency with which they have been presented to him. 


Example 33.4 


To make the distinction of 33.18 clearer, let us discuss some actual examples. The 
table in Example 33.1 above is certainly not of our last type, with both sets of marginal 
frequencies fixed, but it is not clear, without further information, which of the other 
types it belongs to. Possibly 818 persons were examined and then classified into the 
2x2 table. Alternatively, two samples of 279 inoculated and 539 not-inoculated per- 
sons were separately examined and each classified into “ attacked ”’ and “ not-attacked.” 
It is also possible that two samples of 69 attacked and 749 not-attacked persons were 
classified into “inoculated ”’ and “ not-inoculated.”” There are thus three ways in 
which the table might have been formed, one of the double-dichotomy type and two 
of the homogeneity type. Reference to the actual process by which the observations 
were collected would be necessary to resolve the choice. 
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To illustrate the last type in 33.18, we give a fictitious table referring to the butter- 
margarine tasting experiment there described : 


Identification made by subject 
Butter Margarine 


Object actually {Butter 4 11 15 
presented Margarine 11 14 25 
15 a5 40 


33.19 We have no right to expect the same method of analysis to remain appro- 
priate to the three different real situations discussed in 33.18 (although we shall see 
in 33.24 below that, so far as tests of independence are concerned, the Case I test 
turns out to be optimum in the other two situations). We therefore now make prob- 
abilistic formulations of the three different situations. We begin with the both- 
margins-fixed situation, since this is the simplest. 


Case I: Both margins fixed 
On the hypothesis, which we write 
H,:22 = 2%, (33.48) 
Pa oPs 
the probability of observing the table 
Nyy Myo | 11. 
No, Nee | Ne. 


nN No | 0 (33.49) 


when all marginal frequencies are fixed is 


Py = P{nj;|n,11.,1.1} os P{njj|n, 14. \/P{n4|n} 


= $y Nf is. n 
= (ri)(en)/ Ce) 
n,,! 1,1! M,,!n,.! 
~ a! my1! ny! toy! Mas! ee. 

(33.50) is symmetrical in the frequencies m;; and in the marginal frequencies, as it 
must be from the symmetry of the situation. Since all marginal frequencies are fixed, 
only one of the m;; may vary independently, and we may take this to be m,, without 
loss of generality. Regarding (33.50) as the distribution of m,,, we see that it is a 
hypergeometric distribution (cf. 5.18). In fact, (33.50) is simply the hypergeometric 
f.f. (5.48) with the substitutions 

Ns nm n=}, Np = N,1) Nq = 1,2) A= No.» 

j=nmy, N-j=M., Np—j = Mar, Nq—(n—J) = Moo 
The mean and variance of ,, are therefore, from (5.53) and (5.55), 


E (m1) = my, 1,;/n, 


(33.51) 


Ny, N,4 Ng, N.» 


n2(n—1) ’ 


var%,,-= 
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and n,, is asymptotically normal with these moments. ‘Thus 
n 


ee n/n 
Ny, 1,1 Ng, N,2\+ a 
n*(n—1) 
is asymptotically a standardized normal variate. Replacing (n—1) by n, we see that 
(33.52) is equivalent to m! V, where V is defined at (33.12) and hence (cf. 33.17) #? is 


equivalent to the X? statistic defined at (33.46). This confirms that the general large- 
sample test of 33.17 applies in this situation. 


33.20 We may use (33.50) to evaluate the exact probability of any given configura- 
tion of frequencies. If we sum these probabilities over the “tail” of the distribution 
of m,,;, we may construct a critical region for an exact test, first proposed by R. A. 
Fisher. The procedure is illustrated in the following example. 


Example 33.5 (Data from Yates, 1934, quoting M. Hellman) 


The following table shows 42 children according to the nature of their teeth and 
type of feeding. 


Normal teeth Mal-occluded teeth | ToTats 


Breast-fed 4 16 20 
Bottle-fed 1 21 22 
"TOTALS 5 37 42 


These data evidently do not leave both margins fixed, but for the present we use them 
illustratively and we shall see later (33.24) that this is justified. 

We choose as 2, a frequency with the smallest range of variation, i.e. one of the 
two frequencies having the smallest marginal fr equencies. In this particular case, given 
the fixed marginal frequency n,, = 5, the range of variation of my is from 0 to 5. 

The probability that 2,, = 0 is, from (33.50), 


51371 201 221 
Miwioisin 
The probabilities for n,, = 1,2,... are obtained most easily by multiplying by 
5x20: 4x19 =3%388 


ixi? 2x19 3x00? 
and are as follows: 
Number of normal Probabilities 
breast-fed children (7,;) Probability cumulated upwards 

0 0-0310 1-0001 
1 0-1720 0:9691 
2 0-3440 0:7971 
3 0-3096 0-4531 
3 0-1253 0-1435 
5 0-0182 0:0182 


1-:0001 
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To test independence against the alternative that normal teeth are positively associ- 
ated with breast-feeding, we use a critical region consisting of large values of v,, (the 
number of normal breast-fed children). We have a choice of two “‘ reasonable ”’ values 
for the size of the exact test. For « = 0-0182, only 2,, = 5 would lead to rejection of 
the hypothesis ; for « = 0-1435, 2,, = 4 or 5 leads to rejection. Probably, the former 
critical region would be used by most statisticians, leading in this particular case 
(1,, = 4) to acceptance of the hypothesis of independence. 


33.21 ‘Tables for use in the exact test based on (33.50) have been computed. Finney 
(1948) gives the values of 72, (his 6) required to reject the hypothesis of independence for 
values of 71, Mo, (or 1,1, 1,2) and m,, up to 15 and single-tail tests of sizes « < 0:05, 
0-025, 0-01, 0-005, together with the exact size in each case. Finney’s table is reproduced 
in the Biometrika Tables. Latscha (1953) has extended Finney’s table to n,, mo. = 20. 
These tables are extended up to m,., m2, = 40 in Finney et al. (1963). Armsen (1955) 
gives tables for one- and two-tailed tests of sizes « < 0-05, 0-01 and 7 ranging to 50. 
Bross and Kasten (1957) give charts for one-sided test sizes « = 0-05, 0-025, 0-01, 0-005 
or two-sided tests of size 2%, and minimum marginal frequency (say 1.1) < 50, based on 


Ny Nay 
the approximation of the hypergeometric (33.50) by the binomial *) bs (“) : 


thas n n 
The critical values in the charts are conservative unless n,,/n is small. 


Case II: One margin fixed; homogeneity 

33.22 We write the hypothesis (which is now one of equality of probabilities in 
two populations) in the form (33.48) as before, but 1,, and m., are fixed and n,,,m,, 
are independent random variables, so that m, (and hence its complement 7,,) is a 
random variable. We test the hypothesis (33.48) by considering the corresponding 
difference of proportions 

yo S11 Tat, (33.53) 
My <1, 
On the hypothesis, this is asymptotically normal with mean zero and variance 


varu = p(1—p) (— +=), 


No, 
where p is the hypothetical common value of Pir Par 


i; PE 
We estimate p(1—p) unbiassedly by the pooled estimator 


so the estimated variance of wu is 
A nn ee N.4n 
Wie == ( +) = aoe 


n(n—1)\ny, mg.) (n—1) 14, ng, 
Thus we have an asymptotic standardized normal variate 
Mir May 
2. (33.54) 
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and this is identical with (33.52). The large-sample tests are therefore identical in 
the two cases. 
But for small samples, the test is different. On the hypothesis, we now have 


Py = P{1y,|p, my. }P{121|p, Mo. } 
== ny. N11 — fy)\"12 Na, Nas — fF) \22 
= (i Jo py (7 or (ay (33.55) 


that is to say, to P, defined at (33.50) multiplied by a binomial factor 


(1, )e™*.—P). 


This must evidently be so, for we have the original probability for fixed m,, now multi- 
plied by the probability of m,, itself. Unlike (33.50), (33.55) depends on an unknown 
parameter, p, and cannot be evaluated. 


Case III: No margin fixed; double dichotomy 

33.23 We now turn to the case where mis fixed, but none of the marginal totals. The 
hypothesis is now genuinely one of bivariate independence. We have already derived 
the large-sample test in this context in 33.15-17. The exact probability of m,, is now 


Pry = P{myy|m1.,0,1,0}P {1.1|p, 0} Pl, Pon}, (33.56) 


where p’ is the hypothetical common value of p1;/p.1, Pi2/P.2. The first two factors 
on the right of (33.56) are equivalent to (33.55), and the third is 


P{n,,|p',n} = os (p's (1p. (33.57) 
Thus (33.56) depends on two unknown parameters (p,p’) and cannot be evaluated. 


The optimum exact test for 2 x 2 tables 
33.24 We may now demonstrate the remarkable result, first given by Tocher 
(1950), that the exact test based on the Case I probabilities (33.50) actually gives UMPU 
tests for Cases II and III. The argument can be made very simple. (33.55), the 
Case II distribution of n,, given Hy, contains a single nuisance parameter p, the hypo- 
thetical common value of p,;/P1., P21/P2.- It is easily verified that when H, holds, the 
pooled estimator p = mt os mt is sufficient for p. By 23.10, it is complete and 
1. 2. 
distributed in the linearized exponential form (23.17). ‘Thus one- and two-sided 
UMPU tests of H, will, by 23.30-1, be based on the conditional distribution of 7,, 
given 7,,, i.e. (since m,, is already fixed) upon (33.50). Cf. Exercise 23.22. 
Similarly, in Case III, we have two nuisance parameters (p, p’) in (33.56) for which 
Ny, 14, 
= 
and 23.30-1, UMPU tests of H, will be based on the conditional distribution of 1, 
given (”,1,7,,), i.e. upon (33.50). 
Thus the conditional Case I distribution (all marginal frequencies fixed) provides 
UMPU tests for both the homogeneity and double-dichotomy situations. 


are jointly sufficient and complete when H, holds. ‘Thus, again from 23.10 


CATEGORIZED DATA 555 


It should be remarked that these results only hold strictly if randomization is per- 
mitted in order to obtain tests of any size «; the discreteness of the distributions in 
a 2x2 table limits our choice of test size—cf. Example 33.5 and also 20.22. Unless 
the frequencies are very small, however, there is usually at least one “ reasonable ” 
value of « available for the conditional exact test based on (33.50), so that the difficulty 
is theoretical rather than practical. 


33.25 Although the same test is valid in the three situations, its power function 
will differ, since the alternative to independence must obviously be different in the 
three situations. For Case II (homogeneity), Bennett and Hsu (1960) give charts of 
the power function using the Finney—Latscha tables (cf. 33.21). Patnaik (1948) gave 
approximations adequate for larger samples—see also J. Hannan and Harkness (1963). 
E. S. Pearson and Merrington (1948) carried out sampling experiments on the power 
functions of the exact and asymptotic tests in Case I (both margins fixed). Harkness and 
Katz (1964) compare the power functions in the three cases—see also Harkness (1965). 


33.26 Berger (1961) gives large-sample 7? tests for the equality of a measure of 
association in two separate 2x2 tables. ‘Three measures of association are considered: 
(a) the ratio p1/Pi2 in (33.33); (b) QO defined by (33.9); and (c) one equivalent to V 
defined at (33.12). Goodman (1963a) generalizes to the case of k separate 2 x 2 tables 
and (1964a) gives other methods based on the cross-ratio ad/bc. See also 33.62 below. 


Continuity correction in the large-sample X? test 
33.27. A continuity correction of the type mentioned in 31.80 was found by Yate 
(1934) to improve the fit of the continuous approximation (33.52) to the discrete exact 
distribution (33.50) in Case I. The correction requires (using (33.12)) that 
oo (33.58) 
(a+b) (a+c)(b+d)(c+d) 
should have the term (ad—bc) in its numerator replaced by | ad—bc|— $n, which is 
the same as increasing (if ad > bc) b and c by §, and reducing a and d by 3. Thus the 
corrected test statistic is 


Xz = n{| ad—bc|—4n}? 
° (a+b) (a+c)(b+d)(c+d) 
The effect is illustrated in Example 33.6. 


(33.59) 


Example 33.6 


In Example 33.5, we found the probability that 1,, > 4 to be 0-1435. Let us 
compare this with the result obtained by using the asymptotic x? distribution with 
one degree of freedom. From the table of Example 33.5 we po he for (33.58), 

{4(21)—16(1)}? _ 
_o A er pe Bee | Sos 
From Appendix Table 4b, P{.X? > 2-386} = 0-122, a more exact value being 0-1224. 
This, however, is the probability associated with a two-tailed test, because X? is the 
square of a normal deviate. For comparison with the exact test, we have to halve 
this, obtaining 0-0612. ‘The approximation to the exact value of 0-1435 is very poor. 
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If we apply a continuity correction, the corrected value X? is then, by (33.59), 
42(68— 21) 
ais Sapa SY 


The corresponding probability from the y? table is 0-2854, one half of which is 0-1427, 
this time in excellent agreement with the exact value of 0-1435. 


Cochran (1954) recommends that the continuity-corrected X? be used as an adequate 
approximation to the exact test for nm > 40; and if no hypothetical frequency is less than 5, 
for n > 20 also. 

In Cases II and III, a continuity correction does not improve the fit of (33.52) to 
(33.55) or (33.56)—cf. Plackett (1964). 

Lancaster (1949a) examined the effect of continuity corrections in cases where a 
number of X? values from different tables are added together. In such circumstances, 
each X? should not be corrected for continuity, or a serious bias may result. Lancaster 
shows that where the original tables cannot be pooled, the best procedure is to add the 
uncorrected X? values. Similar results for one-tailed tests are given by Yates (1955). 


The general r x c table: measurement of association 


33.28 We now consider the more general situation in which two variables are 
classified into two or more categories. We extend our notation to write the r xc table 
in the form: 


N14 Nyg + + + Myo Ny, 


Noy Nog - + + Moo Mg, 
: | (33.60) 
| 
Nyy aoe. sr S 
N41 Now. .N, | nN 


In the older literature, (33.60) is called a contingency table. The discussion of 33.1-2 
applies to this general two-variable categorization. 

The problem of measuring association in such a table presents severe difficulties 
which are, in a sense, inherent. In Chapter 26 we found in the case of measured 
variates that it may be impossible to express a complicated pattern of interdependence 
in terms of a single coefficient, and this holds similarly in the present situation. The 
most successful attempts to do so have been based on more or less latent assumptions 
about the nature of underlying variate-distributions. 


33.29 In (33.60), if the two variables were independent, the frequency in the ith 
row and jth column would be ,.2,;/n. The deviation from independence in that 
particular cell of the table is therefore measured by 

Dj; = Ny — Nn; n,;/n, (33.61) 
the generalization of (33.8). We may define a coefficient of association in terms of 
the so-called square contingency & Dj/(n;.n,;) and shall write 

tJ 


“Sp 2 
ee nf = -1}, (33.62) 


i,j. N,;/nN i,j 1. 1.; 
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the generalization of (33.46). On the hypothesis of independence, X? is asymptotically 
distributed in the y? form, as is easy to see from the goodness-of-fit standpoint men- 
tioned in 33.17 above; the degrees of freedom are given by 


(rc—1)—(r—1)—(c-—1) = (r—1)(e—- J), 


the number of classes minus 1 minus the number of parameters fitted. 


33.30 X®? itself is not a convenient measure of association, since its upper limit 
is infinite as m increases. Following Karl Pearson (1904), we put 


x2 \} 


and call P Pearson’s coefficient of contingency. It was proposed because it may be 
shown that if a bivariate normal distribution with correlation parameter p is classified 
into a contingency table, then P? —> p? as the number of categories in the table increases. 
For finite r and c, however, the coefficient P has limitations. It vanishes, as it should, 
when there is complete independence; and conversely, if P = 0, we have X? = 0 
so that every deviation D,; is zero. Clearly 0 < P <1. But in general, P cannot 
attain the same upper limit, and therefore fails to satisfy a desideratum mentioned in 
33.5. Consider, for example, a ‘‘ square” table with r = c, in which only the leading 
diagonal frequencies m;; are non-zero. Then m;, = n,; = nj, all 2, and by (33.62) 


X* = n(r—1) 
r—1\? 
so that, from (33.63) P= —— 


Thus even in such a case of complete association—cf. 33.7—the value of P, its maxi- 
mum, depends on the number of rows and columns in the table. 
To remedy this, Tschuprow proposed the alternative function of X? 


which attains +1 when r = c in a case of complete association as above, but cannot 
do so ifr # cc. In fact, it is easy to see, just as above, that the maximum attainable 
value for X? is nx min(r—1,c—1) (attained when all the frequencies lie in a longest 
diagonal of the table) and thus the attainable upper bound for P 1s 


min(r—1,c—1) )? 
1+min(r—1,c—1)J ’ 


min(r—1,c—1))? | {min(r—1,c—1))# 
[(r—1)(e—1)}8f | max(r—1,ce-1)J © 
Following Cramér (1946), we may define a further modification, which can always 


attain +1, by 
-{ X? \- T Ls arate (33.65) 


nmin(r—1,c—1) min (v—1,c—1) 
Evidently C = T when the table is square but C > T otherwise, although the difference 


while that for T is 
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will not be very large unless r and c are very different. We also see that 

Pt [(r—1)(c-1)} 

T? 14+(X2/n) ’ 
so that as m increases we expect to have P > T if independence holds, when X? has 
expectation (r—1)(c—1). The difference P— T is often substantial. Cf. Exercise 33.4. 


Example 33.7 

The table (from W. H. Gilby, Biometrika, 8, 94) shows the distribution of 1725 
school children who were classified (1) according to their standard of clothing, and (2) 
according to their intelligence, the standards in the latter case being A = mentally 
deficient, B = slow and dull, C = dull, D = slow but intelligent, E = fairly intel- 
ligent, F = distinctly capable, G = very able. 


Table 33.2 
Intelligence | 
class AandB C D E EG “YOrais 
Standard 

of clothing . 
Very well clad .. = 33 48 113 209 194 39 636 
Well clad .. ae os 41 100 202— 255 436-15 751 
Poor but passable aS 39 Jo 70 - 61 265 
Very badly clad .. = 17 13-32-48 = 40 Sg 73 

TOTALS — 130 219-40] 535 375 3 as 


We investigate the association between standard of clothing and intelligence. We 
first work out the “independence” frequencies n,,n,;/n. For example, 1, n,,/n is 


636 x 1130/1725 = 47-930. The term nD?,/n,,n,, in (33.61) is then 
(33 — 47-930)?/47-930 = 4-651. 


The sum of the 24 such terms in the table will be found to be X? = 174-92. 
It is quicker to calculate X* from the extreme right-hand side of (33.62), i.e. to 
calculate : 
X? = n{ eee i}, (33.66) 
i, j Mie Ne; 
and with a calculating machine this is expeditiously evaluated by first dividing the 
square of every frequency in the 7th row by its row total, and then dividing all the result- 
ing quotients in the jth column by that column’s original total frequency. We should, 
in this case, first have 
33°. 4 1S 2 
636 636 636 636 636 636’ 
41° “1004 202" - 2554-5 
Pos: TAT Fal ie 
and two further rows, and then divide the columns of this array by 130, 219, etc. We 
then have only to subtract 1 from the total of the 24 entries in the final array, and 
multiply by 1725(= ) to have (33.66). The reader should check the computation 
by both methods. 
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With X? = 174-92, we now have, from (33.63-5), the coefficients 
P- ( 174-92 ) = 0-303, 


1725 + 174-92 
174.92)! 
ee 1728 et ee 
174-92 }? 
C= \TaFx3 cap = 0184. 


The relationship between the three coefficients is as we should expect from the remarks 
at the end of 33.30, with C little larger than T, but P nearly twice as large, even though 
its attainable upper bound here is (3) = 0-866 against (=) = 0-880 for T (and, of 
course, 1 for C). Thus it happens that the attainable upper bounds for P and T are 
almost the same in this particular case, but P gives the impression of a considerably 
stronger association between the variables. Exercise 33.3 finds similar results for 
another set of data. 

All three coefficients are monotone functions of X?, and we may therefore test 
independence directly by X*, which we have seen to be 174-92. As there are 
(r—1)(c—1) = 15 degrees of freedom, this far exceeds any critical value likely to be 
used in practice—a test of size « = 0-001 would use a critical value of 37-697. 


Models for the r x c table 

33.31 In discussing measures of association in the rxc table, distinctions as to 
the underlying model, similar to those made for the 2 x 2 table in 33.18-23, are neces- 
sary. S. N. Roy and Mitra (1956) have explicitly extended that discussion of the three 
types of table (both sets of marginal frequencies fixed, one set fixed, neither set fixed) 
to the general rx c table ; no new point arises. Roy and Mitra go on to show (as we 
did for the 2x 2 table) that, on the hypothesis of independence, X? defined at (33.62) 
is asymptotically distributed in the X? form with (r—1)(c—1) degrees of freedom 
under all three models. It is intuitively obvious that the differences between the 
models, given the hypothesis of independence, vanish asymptotically, since any marginal 
frequencies which are random variables will converge in probability to their expectations. 

Exact treatment of the r xc table on the lines of 33.20 is necessarily a tedious piece 
of enumeration. Freeman and Halton (1951) give details of the method. 


Standard errors of the coefficients and of X? 

33.32 The coefficients of association (33.63—5) are all monotone increasing func- 
tion of X2, and their standard errors can therefore be deduced from the standard error 
of X? by the use of (10.14). For example, the standard error of (X?)?, to which both 
(33.64) and (33.65) reduce apart from constants, is, to order n7}, ; 


var {(X?)*} = {4 (XX?) }? var X? = aya var X, (33.67) 


where X? is the population value of X?, i.e. (33.62) calculated from the population 
frequencies. Clearly, the first approximation (33.67) is only valid if X? 4 0, i.e. it 
does not hold in the case where the two categorized variables are independent in the 
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population. This is because the sample X? converges in probability to the popula- 
tion X?, and since X? > 0 its distribution has a singularity when X? = 0 and its vari- 
ance is of order n-2—cf. the analogous situation for the squared multiple correlation 
coefficient in 27.33. Fortunately, we need not pursue the case of independence, since 
we may then test with X? itself. If we wish to estimate non-zero population coefficients 
T and C defined in terms of X? by the population analogues of (33.64—-5), and set stan- 
dard errors to the estimates, we have 


1 
4n2[(r—1) (c—1)]T? 
1 
Vee = fee 1 
For Pearson’s coefficient (33.63), the same difficulty arises in the case of independence, 
since 


var? = var X?, T <6, 


(33.68) 
var X?, CHU. 


ee a n? 
a(X2)) 4X2 (n+ X2)®” 
so that we may only write 


var P = 


n* ae . 
4X2(n + X29 var A*. uh = 0, (33.69) 
This, unlike the expressions (33.68), cannot be written in terms of a parent coefficient P 
alone. 


33.33 From 33.32, we see that we need only calculate the variance of X? to obtain 
the standard errors we require. The variance of X? in the case of non-independence 
is complicated, but was worked out by K. Pearson (1915) for a table with fixed marginal 
frequencies, and more accurately by Young and Pearson (1915, 1919) who gave the 
variance to order 1/n?. Kondo (1929) gives the mean and variance to order 1/n? when 
the marginal frequencies are random variables, and shows explicitly that the variance is 
of order 1 /n? in the case of independence. (‘The exact variance of X ? in the independ- 
ence case is given by Haldane (1940)—a specialization of his result is given as Exercise 
33.9.) The formulae are lengthy, and we shall quote only the first approximation 
given by K. Pearson (1915): 

a—n: n/n)? X? 2\ 2 
estimated var X? = 4n {Ez eye Rash + at (=) \. (33.70) 
If (33.70) is substituted into (33.68-9) and X?, T, C, written for X*, T, C, we have 
the required standard errors. 


33.34 The summation on the right of (33.70) differs from the definition of X? 
at (33.62) only in that the denominator term is squared. If the marginal frequencies 
all increase with n, this implies that the dominating term on the right of (33.70) will 
be the middle one in the braces, so that asymptotically we may estimate the variance 
of X? by 4nX?/n = 4X, This may also be seen directly. Under the conditions of 
30.27, we have that X? will be distributed asymptotically in the non-central y? form, 
in this context with » = (r—1)(c—1) degrees of freedom and non-central parameter 
(30.62), where the po; here are the “independence” frequencies. By Exercise 24.1, 
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the variance of such a distribution is 


2(v+2A) = 24 (r= 1)(e-1 anny Pe Als wr}. 


For large m, the first term on the right is negligible,(*) and the second is estimated by 


An pe Me eee eeey S 
nN; hin 


Thus the leading term in the peer error of X? derives from its asymptotic non-central 
y? distribution. It is worth noting that the non-central parameter A in the distribution 
of X? is estimated by X? itself, so that the use of X? and its standard error is equivalent 
to setting approximate limits for 24. Bulmer (1958) discusses confidence limits for 2}, 
which is a natural ‘ distance’ parameter. 

Substitution of var X? = 4X? into (33.68) gives, after simplifying, the approximations 


var T = 1/{n[(r—1)(c—1)}*?}, var C = 1/{n min (r—1)(c—1)}. 


Other measures of association 


33.35 It cannot be pretended that any of the coefficients (33.63-5) based on the 
X®? statistic has been shown to be a satisfactory measure of association, principally 
because their values have no simple probabilistic interpretation—cf. Blalock (1958). 
A number of more readily interpretable measures were proposed by Goodman and 
Kruskal (1954, 1959). We have already seen in 33.11-14 that two of their principal 
suggestions reduce in the case of a 2x2 table to two of the conventional measures of 
association. For the general rxc table, this is not so, and the Goodman-—Kruskal 
coefficients are not functions of the X? statistic, unlike the measures so far discussed. 
For example, generalizing the approach of 33.13-14, they define a population coefficient 
which is the relative decrease in the probability of incorrectly predicting one variable 
when the second variable is known, and a symmetrized coefficient which takes both 
predictive directions into account. For ordered classifications, the approach of 33.12 
generalizes similarly—we refer to it in our discussion of ordered tables in 33.40 below. 
Goodman and Kruskal (1963) develop formulae for the standard errors of some of their 
coefficients under various sampling models. See also Goodman (1964b). 


Lancaster and Hamdan (1964) generalize the tetrachoric method of 26.27=9 to a 
polychoric method for estimation of the correlation coefficient in the general r xc table. 
The method, which uses sets of (ry —1) and (c—1) orthonormal functions as in 33.44-5, 
requires electronic computing facilities, but gives much better results than P at (33.63), 
which essentially (cf. 33.34) equates X? to its asymptotic value in the bivariate normal 
case, Np?(1—p?)—1, and often gives unduly low estimates. 


A LR statistic may be defined which is asymptotically equivalent to X?, as in 30.5 
and Exercise 30.11. Gabriel (1966) shows how this may be used to test subsets of fhe 
categories, and of the populations, compared—cf. the discussion of simultaneous test 
procedures in 35.54, 35.63 (Vol. 3). 


(*) Except i in the case of independence, when the second term is zero—this is the singularity 
referred to in 33.32 above. 
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Ordered tables : rank measures of association 


33.36 If there is a natural ordering (cf. 33.1) of row- and of column-categories 
in arxc table, we are presented with a new situation, which was not distinguished in 
the 2x2 table case because with only two categories the two possible orders of the 
categories can only change the sign of any measure of association. With three or more 
categories, the knowledge that there is an order between the categories conveys new 
statistical information which we may use in measuring association. Generally, we are 
unable to assume any underlying metric for the categories ; we know that the cate- 
gories proceed, say, from “high” to “low” values of an underlying variable, but 
we can attach no numerical values to them. In such a case, we may make what is 
perhaps a slightly unexpected application of the rank-order statistic ¢ discussed in 
31.24. For we may regard an rxc table with a grand total of 2 observations as a way 
of displaying the rankings of n objects according to two variables, for one of which 
only r separate ranks are distinguished and for the other of which only ¢ separate ranks 
are distinguished. From this point of view, the marginal frequencies in the table are 
the numbers of observations ‘‘ tied ” (cf. 31.81) at the different rank values distinguished. 
The case where there are no “‘ ties’ corresponds to a table of all whose marginal 
frequencies are unity. 


33.37 The measurement of association is now seen to be simply the problem of 
measuring the correlation between the two rankings. Either of the coefficients ¢ and 
r, defined at (31.23), (31.40) may be used, but we shall discuss only the former. 

Some slight problems are producéd by the fact that we are interested here in rank- 
ings with many ties. In the first place, we can no longer define the rank correlation 
coefficient in terms of a simple 0-1 scoring system as in the definition of h,; at (31.36), 
for we now have three possibilities instead of two. We therefore define 


+} if Xi <= X jy +4 if Vi ae Viy 
ay, = 0) if Xi; = Xj bj; =e 0 if Vi = Vi» 
— | if Xi > Xjy —1 if Vi ae Vie 


Our measure of rank correlation is now to be based on the sum 


S = 24,,);,, Re ae ee Ss Se (33.71) 
1, j 


If we wish to standardize S to lie in the range (—1, +1) and attain its endpoints in 
the extreme cases of complete dissociation and complete association, thus satisfying 
the desideratum of 33.5, we have a choice of several possibilities : 

(1) If there were no ties, no a;, or b,; could be zero, and (33.71) would vary between 
+n(n—1) inclusive. ‘The measure of association would then be S/{n(n—1)}. The 
reader may satisfy himself that this is identical with ¢ of (31.23), from the definitions 
of h,; and a;;,b;;. If some scores a,;,5,, are zero, this measure, which we shall now 
write 

S 
La eee SS 
n(n—1) 
can no longer attain +1; its actual limits of variation depend on the number of zero 
scores. 


(33.72) 
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(2) If we rewrite the denominator (33.72) for the case of no ties as 
n(n—1) = {Xa x DE}, 
i,j 4,j 
which makes clear that ¢ is a correlation coefficient between the two sets of scores (cf. 
Daniels (1944)), we may define 
2s Ais bi; 
i,j = 
(Zaj EB} 
i,j 4,9 


c= (33.73) 
t, and #, are identical when there are no zero scores, but otherwise the denominator 
of (33.73) is smaller than that of (33.72) and thus ¢, > t,. Even so, #, cannot attain 
+1 generally, for the Cauchy inequality 
(2 a,;b,;)? < Dajruody 

only becomes an equality when the sets of scores a;;, b,; are proportional, which here 
means that all the observations must be concentrated in a positive or negative leading 
diagonal of the table (i.e. north-west to south-east or north-east to south-west). If 
no marginal frequency is to be zero, this means that only for a square table (i.e. an 
rxr table) can f, attain +1. 

(3) For a non-square rxc table (r 4 c), |X a,;b,;| attains its maximum when all 

9 


the observations lie in cells of a longest diagonal of the table (i.e. a diagonal containing 

m = min(r,c) cells) and are as equally as possible divided between these cells. If 

n is a multiple of m (as we may suppose here, since is usually large and m a small 

integer), the reader may satisfy himself that this maximum is 1*(m—1)/m, and thus 
a third measure is 

md aj; bi; 

i, j 

i= 2 ee (33.74) 

t, can attain +1 for any rxc table, apart from the slight effect produced by m not being 

a multiple of m. For large n, (33.72) and (33.74) show that ¢, is nearly mt,/(m—1). 


33.38 The coefficients ¢, and tf, do not differ much in value if each margin contains 
approximately equal frequencies. For 


c 
Sa, = w(a—1T)— 2 8 h..—1) 
c 
—= 2 
=n— 2 t,, 
. . p=1 
and similarly 
r 
ae ae 
U6; = n?— b My. 
45) p=l1 


Thus the denominator of t, iS 
(% az, d bz)? = {(n- 3 n,) (n:— > ns.) (33.75) 
p=1 p=1 


oe ae tS (33.76) 
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n = a m(1—7), (33.77) 


If all the marginal column frequencies ,, are equal, and all the marginal row fre- 
quencies m,, are equal, (33.76) reduces to 


n? {(1 -7)(! Zs )t (33.78) 


approximately. (33.78) is the same as (33.77) if the table is square (r = c = m); 
otherwise (33.78) is the larger and thus #, the smaller. This tends to be more than 
offset by the fact that if the marginal frequencies are not precisely equal, the sums of 
squares will be increased, and (33.76), the denominator of ¢#,, therefore decreases. 


The following example (cf. also Kendall (1962) ) illustrates the computations of the 
coefficients. 


while that of f, is 


Example 33.8 


In the table below, we are interested in the association between distance vision in 
right and left eye. 


Table 33.3—3242 men a 30-39 employed in U.K. Royal Ordnance factories 
1943-6 : unaided distance vision 


Left eye 
Highest Second Third Lowest 
re grade grade grade grade | TOTALS 


Highest grade $21 112 — 
Second grade 116 494 145 ae 782 
Third grade 72 151 583 87 893 
Lowest grade 43 34 106 331 514 

TOTALS 1052 791 919 480 3242 


The numerator of all forms of ¢ is calculated by taking each cell in the table in turn 
and multiplying its frequency positively by all frequencies to its south-east and nega- 
tively by all frequencies to its south-west. Cells in the same row and column are 
always ignored. (There is no need to apply the process to the last row of the table, 
which has nothing below it.) rae b,; is twice the sum of all these terms, because 


we may have 7 <j ori > yj. Foe this particular table, we have 
821 (494 + 145 +27+ 1514583 +87+34+ 106 +331) 
+ 112 (145 +27 +583 + 87 + 106 +331 —116—72—43), 


and so on. As we proceed down the table, fewer terms enter the brackets. The 
reader should verify that we find, on summing inside the brackets, 


821 (1958) +112 (1048) +85 (—465)+35 (—1744) 
+116 (1292) +494 (992) +145 (118) +27 (—989) 
+72 (471) +151 (394) +583 (254) +87 (—183) 
= 2,480,223. 


CATEGORIZED DATA 565 


Thus the numerator is 
Xa;;b;; = 4,960,446. 
From (33.75), the denominator of #, is 
[ {32422 — (10522 + 7912+ 9192+ 4802) } {32422 — (10532 + 782? + 8932+ 514?) } ]? 
= [7,728,586 x 7,703,218 }}. 


Thus 
4,960,446 oe 
= [7,728,586 x 7,703,218]? | ere 
From (33.74), on the other hand, 
4x 4,960,446 
ee 


We therefore find ¢, a trifle larger in this case, where both sets of marginal frequencies 
vary by about a factor of 2 from largest to smallest. A similar result is found in Exer- 
cise 33.10, where the range of variation of marginal frequencies is about threefold. 


33.39 Apart from the question of attaining the limits +1 discussed in 33.37 
above, the main difference between the forms ¢, and ¢, is that an upper bound (see 
(33.81) below) can be set for the standard error of ¢, in sampling m observations, the 
marginal frequencies not being fixed ; in such a situation, #, is a ratio of random vari- 
ables and its standard error is not known. If the marginal frequencies are fixed, f, 
is no longer a ratio of random variables, but its distribution has only been investigated on 
the hypothesis of independence of the two variables categorized in the table—of course, 
if we wish only to test independence, we need concern ourselves only with the common 
numerator ¥ a;;b,,—the details of the test are given by Kendall (1962). Stuart (1953) 
showed how the upper bound for the variance of ¢, may be used to test the difference be- 
tween two values found for different tables. This is fairly obvious, and we omit it here. 


33.40 Goodman and Kruskal (1954) proposed a measure of association for ordered 
tables which is closely related to the ¢ coefficients we have discussed. It has the same 
numerator, but yet another different denominator, and is 

ee a Sees (33.79) 
W-dn—-Lue+d Ln, 
p=1 p=1 p=1 q=1 
If we compare the denominator of G with that of t, at (33.75), which is identically 
equal to 
{[m?—-3 (Sn, +hn,) P—2(Sn,—=n,)? }. 
p p p p 
and is thus very nearly n2—}(2n2,+2n?2), it will be seen that the denominator of G 
Pp Pp 


is in practice likely to be smaller always, What is more, it is easily seen that G can 

attain its limits +1 if all the observations lie in a longest diagonal of the table. ‘Thus 

G is rather similar to t,. Goodman and Kruskal (1963) give the standard error of G, 

a method of computing it, and a simple upper bound for it which is estimated from 
Bate SL 

eo) (33.80) 


varG < — +2. / 
~ n De@/n?’ 
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where Dg is the denominator of G at (33.79). This compares with the upper bound 


for the variance of f, 2 9 
a ={(="5) -#}. (33.81) 


m—1 

Goodman and Kruskal (1963) show in two worked examples that G tends to be larger 
than ¢,, but that the upper bound for its standard error is considerably smaller ; the 
details are given in Exercise 33.11. If this is shown to be true in general, this fact, 
together with the direct interpretability of G in terms of order-relationships in random 
sampling (it gives the probability of the orders of x and y agreeing minus that of their 
disagreeing, conditional upon there being no ties—cf. 33.12 for the 2 x 2 case) would 
make it likely to become the standard measure of association for the ordered case. 


Small-sample experiments by Rosenthal (1966) verify the applicability of the asymp- 
totic theory for G. 


Scoring methods with pre-assigned scores 

33.41 Returning to our general discussion of rxc tables, we now consider the 
possibilities of imposing a metric on the categories in the table. If we assign numerical 
scores to the categories for each variable, we bring the problems of measuring inter- 
dependence and dependence back to the ordinary (grouped) bivariate table, which we 
discussed at length in Chapter 26. ‘Thus, we may calculate correlation and regression 
coefficients in the ordinary way. ‘The difficulty is to decide on the appropriate scoring 
system to use. We have discussed this from the standpoint of rank tests in 31.21-4, 
where we saw that different tests resulted from different sets of ‘‘ conventional num- 
bers ”’ (i.e. “ scores ’”’ in our present terminology). Here, the difficulty is more acute, 
as we are seeking a measure, and not merely a test of independence. 

The simplest scoring system uses the sets of natural numbers 1, 2,...,7 and 
1,2,...,c¢ for row and column categories respectively. Alternatively, we could use 
the sets of normal scores E(s,r) and E(s,c) discussed in 31.39. Example 33.9 illus- 
trates the procedures. 


Example 33.9 
Let us calculate the correlation coefficients, using the scoring systems of 33.41, for 
the data of Example 33.8. For the natural numbers scoring, we assign scores 1, 2, 3, 4 
to the categories from “ highest ” to ‘‘ lowest ” for left eye (x) and (because the table 
here happens to be square) similarly for right eye (y). We find, with m = 3242, 
ax = (1052 x 1)+(791 x 2)+(919 x 3)+ (480 x 4) = 7311, 


Sy = (1053 x 1)+(782 x 2) +-(893 x 3)+(514x« 4) = 7352, 
Sx? = (1052 x 12)+ ... = 20,167, 

Sy? = (1053 x 12)+ ... = 20,442, 

Say = (821 x 1x 1)+(112x1x2)+... = 19,159. 


Thus the correlation coefficient is, for natural number scores, 
____19,159—(7311)(7352)/3242 
[ {20,167 —(7311)*/3242 } {20,442 — (7352)*/3242} } 


= 0-69. 


Ty 


~ (3677 x 3772)! 
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This is not very different from the values of the ranking measures t, = 0-64, t, = 0-63 
found in Example 33.8, and we should expect this since the “ natural numbers ” 
scoring system is closely related to the rank correlation coefficients, as we saw in 
Chapter 31. 

Suppose now that we use the normal scores 


E(1,4) = —1-029, 
E (2,4) = —0-297, 
E (3,4) = +0-297, 
E(4,4) = +1-029, 
obtained from the Biometrika Tables, Table 28. 
We now simplify the computations into the form 
xx = 1-029 (480 — 1052) +0-297 (919 —791) = —550-6, 
Ly = 1-029(514— 1053) + 0-297 (893 — 782) = —521-7, 
Ex? = (1-029)? (480 + 1-052) + (0-297)? (919+ 791) = 1773, 
Ly? = (1-029)? (514 + 1053) + (0-297)? (893 +782) = 1807, 
Tey = (1-029)? (821 +331 —35 — 43) + (0-297)? (494 + 583 — 145 — 151) 
+ (1-029) (0-297) (116 + 112 + 106 + 87 — 72 — 34 — 85 —27) 
= 1268. 


Thus the correlation coefficient for normal scores is 
1268 — (550-6) (521-7) /3242 


"2 = 11773 — (550-6)2/3242 } { 1807 — (521-7)2/3242}]} 
1179 
= (1680 x 17235 ° ie 


exactly the same to two decimal places as we found for natural number scores. It 
hardly seems worth the extra trouble of computation to use the normal scores, at least 
when the number of categories is as small as 4. 


33.42 If one were strictly trying to impose a normal metric upon the rx table, 
a more reasonable system would be to assign scores to the categories which correspond 
to the proportions observed in sampling from a normal distribution. ‘Thus, in Example 
33.9, we should calculate the ‘‘ cutting points” of a standardized normal distribution 
which give relative frequencies 

1052 = Foi. 919-489 

3242’ 3242’ 3242” 3242’ 
and use as the “ left eye” scores the means within these four sections of the normal 
distribution. 

We need not make the calculation for the moment, but it is clear that the set of 
scores obtained will differ from the crude normal scores used in Example 33.9. We 
return to this scoring system in 33.50 below. 

We do not further pursue the study of scoring methods with pre-assigned scoring 
systems, because it is clear that by putting ‘“‘ information” into the table in this 


way, we are making distributional assumptions which may lead us astray if they are 
00 
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incorrect. On the whole, we should generally prefer to avoid this by using the 
rank order methods of 33.36-40. Yates (1948) first proposed the natural numbers 
scoring system of 33.41, and E. J. Williams (1952) surveyed scoring methods generally. 


Least Squares analysis in a rxc table 


33.43. If we are investigating the dependence of one categorization (say, the rows) 
upon the other, we can avoid imposing a metric upon the latter. Define a score x; 
(i = 1, 2,..., c—1) for all columns but the last, such that x; = +1 for each observa- 
tion in the 7th column, and x; = 0 otherwise. The set of (c— 1) x-scores then identifies 
the column for every observation, the cth column being indicated by all (c—1) scores 
being zero. Now, if the row-categorization has (or has imposed upon it) a metric, an 
ordinary LS analysis of its dependence upon the x-scores may be carried out by the 
methods of Chapter 19, in which there will be a parameter 0; for each x; No new 
theoretical point arises. 


The choice of “ optimum ” scores: canonical analysis 

33.44 However, we may approach the problem of scoring the categories in a (not 
necessarily ordered) rxc table from quite another viewpoint. We may ask: what 
scores should be allotted to the categories in order to maximize the correlation coefficient 
between the two variables? Surprisingly enough, it emerges that these ‘“‘ optimum ”’ 
scores are closely connected with the transformation of the frequencies in the table to 
bivariate normal frequencies. We first prove a theorem, due to Lancaster (1957), for 
ungrouped observations. 

Let x and y be distributed in the bivariate normal form with correlation p. Let 
x’ = x'(x) and y’ = y’(x) be new variables, functions respectively of x alone and y 
alone, with E{(x’)?} and E{(y’)?} both finite. Then we may validly write 

x’ = ay+a,H,(x)+a,H,(x)+..., (33.82) 
where the H, are the Tchebycheff-Hermite polynomials defined by (6.21), standardized 
so that = 
| Hiiara(ade =A: 


= e e 

¥ a? will be convergent. The correlation is unaffected by changes of origin or scale, 
i=1 

so we may write a) = 0, and hence 


x = > a;H,, > © = 4: 


i=1 a 
and similarly we may write 


i 


yy" = > b, H,, = b? 


Now H,(x) is, by 6.14, the coefficient of ¢"/r! in exp (tx—}¢?). Since the expectation 
of exp (tx—412+uy—4u?) equals exp (piu), we have 


("f° memo sardy = {f° 75} (33.83) 


where f is the bivariate normal frequency. 


CATEGORIZED DATA 569 


The variances of x’ and y’ are unity in virtue of the orthogonality of the H,, and 
hence their correlation is 


cov(x',y") = ¥ a,bypl. (33.84) 
‘=1 


Now this is less than |p| unless a2? = 6? = 1. The other a’s and b’s must then vanish. 
Hence the maximum correlation between x’ and y’ is |p| and we have Lancaster’s 
theorem : if a bivariate distribution of (x, y) can be obtained from the bivariate normal 
by separate transformations on x and y, the correlation in the transformed distribution 
cannot in absolute value exceed p, that in the bivariate normal distribution. 


33.45 Suppose now that we seek a second pair of such transforms of x and y 
separately, say x’ and y’’. If we require these to be standardized and uncorrelated 
with the first pair («’, y’), the Tchebycheff-Hermite representation 


x’ = 5 c; H; (x), y — 3 d;H;(y), 
i=1 t=1 


together with the orthogonality laid down requires at once that c, = d, = 0. Thus 
we obtain 


and, as at (33.84), 
{==3 


which is maximized in absolute value only if c2 = d2 = 1 and all other c,, d, = 0, when 
the correlation is p. We may proceed similarly to further pairs of variables (gy) 
(x™, y™), etc., obtaining maximized correlations |p|, p4, etc. 

The transformed pairs of variables are known as the canonical variables. What we 
have shown is that the rth pair of canonical variables has canonical correlation |p|". 
Evidently, from our proof, the canonical variables themselves are simply the 
T’chebycheff-Hermite forms in the bivariate normal variables (x,y), i.e. 


(x, y) = (A, («), H,(y)). (33.86) 
Lancaster (1958) further extends this type of analysis. 


33.46 ‘The results of 33.44-5 apply to ungrouped bivariate population values. In 
practice, when we have a sample in an ordered r x c table, there is no difficulty in making 
separate transformations of the variables to achieve marginal univariate normal distri- 
butions for them—this is essentially what we discussed in 33.42—but we should be 
fortunate if we found these separate transformations to result in a bivariate normal 
distribution of the frequencies in the body of the table. However, the theoretical 
implication of the result is clear: if we seek separate scoring systems for the two categor- 
ized variables such as to maximize their correlation, we are basically trying to produce 
a bivariate normal distribution by operation upon the margins of the table. 


33.47 Suppose, then, that we allot scores x;, G7 ee es ores ee ee 
to the categories of an rxc table. Without loss of generality we may take them to be 
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in standard measure (zero mean, unit variance). ‘Then we have 


varx = Ln; xj/n = vary = Un,;y;/n = 1, (33.87) 
i j 

cov (x,y) = corr(*,y) = 4 UM,;%;,y;/N. (33.88) 
.-2 


We require to maximize (33.88) for variation in x and y subject to (33.87). If, u are 
Lagrange undetermined multipliers, this leads to the equations 


245 V;3—A n;,X%; = 0, -= 1, 3 ae ees (33.89) 
J 
Uni x;— wn, y; = 9, 2s © ee (33.90) 
Multiplying (33.89) by x; and summing over 1, we have 
AK, 


where R is the correlation we are seeking. Similarly we find ~ = R, and hence, from 
(33.8990), 


Ns; V5 = Rn,. Xi 1 , a sa, 
j 

91 
145 Xj = Rn; V5; J = . — : (33 9 ) 


Eliminating x and y, we have a determinantal equation which may be written 
symbolically 
Rn,, Ni; 
= Q), (33.92) 


Ni; Rn.,; 
We shall study this in considering the theory of canonical correlations in Volume 3. 
It is enough here to note that R can be expressed in terms of the cell frequencies. In 
fact, (33.92) is an equation in R? with a number of roots. There are, in general, 
m = min(r,c) non-zero roots, one of which is identically unity; we are interested 
only in the m—1 others. These are the canonical correlations. That there can be 
only m non-zero canonical correlations follows from the fact that the rank of the array 
of frequencies {,;} is at most m. We require the largest root, R,. The others, apart 
from sampling effects, are powers of this largest, as we have seen in 33.45. 


33.48 It follows from (33.92) that if the canonical correlations, the roots of (33.92), 
aie RR a1, te 


° ° m—1 
—— me {! +B Rox, ys. (33.93) 


In the limit, as the categories of the r xc table become finer and m—> 0, this reduces 
because of 33.44-5 to the tetrachoric series 


f= Qnyrerp(-Herr {+E HOH} 6394) 


where f is the bivariate normal frequency. (33.94) is simply another form of (26.66), 
which differs only in the factor j! in its denominator, since the H; were not there 
standardized. 
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33.49 When the largest canonical correlation R, has been determined from (33.92), 
we can immediately calculate the ‘‘ optimum ”’ sets of scores giving this correlation. 
This is perhaps most easily done by returning to (33.91). If the second equation there 
is multiplied by ,;/{(m , n;.)'n,;} and summed over j, it becomes 

R py Nyj Y 


GM )ims (yt. ie =) ae ee 
(33.95) may be rewritten 
ij Np; 
ae Ni, oP oe = R?(x,n,.*), (33.96) 


which makes it clear that the ees canonical correlations are the latent roots of the 


(rxr) matrix NN’, where N is the (r xc) matrix whose elements are for 


C3 ae 
NN’u = R?u, ree (33.97) 
where u is the (rx 1) vector with elements x;7;,?. 
Since the rank of N, and hence of NN’, is at most m = min(r,c), NN’ will have 
m non-zero latent roots in general as stated in 33.47. It is easily verified that R? = 1 
is always a root, the latent vector u then having elements (n,.)*, i.e. x, = 1 for this 
root. Leaving aside this root, which is irrelevant to the problem of association, we 
see that once the largest latent root Rj? of (33.97) is determined, the corresponding 
latent vector u, gives the vector of scores for the first canonical x-variable. Similarly, 
the scores for the first canonical y-variable are given by the latent vector v, in 
N'Nv = R’?v, (33.98) 
where v is the (cx 1) vector with elements y,;n,,?.. The non-zero latent roots of the 
(c x c) matrix N’N are, of course, the same as those of NN’, namely the squared canonical 
correlations. However, there is no need to solve both (33.97) and (33.98); we need 
only solve one (it is naturally easier to choose the one corresponding to the smaller 


of r and c, i.e. NN’ if r < c) and then obtain the other set of scores from (33.91), which 
we rewrite 


(33.96) in matrix terms is 


1 
R = weve 9 
2 (33.99) 
y= Rn, 2! ° 


Example 33.10 


Let us make a canonical analysis of the data of Example 33.8. We first rewrite 
the table with the marginal frequencies replaced by their square roots : 


821 112 85 35 | 32:449,961,5 
116 494 145 27 | 27-964,262,9 
72 151 583 87 | 29-883,105,6 
43 34 106 331 | 22-671,568,1 


32-434,549,5 28:124,722,2 30-315,012,8 21-908,902,3 
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We now construct the matrix N by dividing the 7,; in the table by the product of the 
corresponding marginal square roots, e.g. 821/(32-434,549,5 x 32-449,961,5). We 


obtain : 


0-780047593 0-122720070 0-086406614 0-049230386 
0-127892990 0-628109454 0-171043616 0-044069667 

N = | 0.074284618 0-179664791 0-643554112 0-132884065 | (33-100) 
0-058476184 0-053322330 0-154229180 0-666385924 


Here we have r = ¢, so it is immaterial which of NN’ and N’'N we work with. We 
shall compute 


0-633424197 0+193793122 0+142143279 0098290784 
nn’ = { 0193793122 0-442076157 0-238281615 0-096718276 |, 
= | 0-142143279 0:238281615 0-469617711 0:201730920 | (33-401) 


0-098290784 0-096718276 0-201730920 0-474119575 


We recall that the sum of the latent roots is equal to the trace of the matrix, so that if 
the trace of the matrix does not much exceed 1, the largest canonical correlation must 
be small—this is a useful preliminary check. Here, the trace exceeds 2, so that R? 
could be as large as 1. 

We now obtain the latent roots. We must solve the characteristic equation 


| NN’—AI| = 0. (33.102) 
If we subtract 4 from each diagonal element in NN’, and expand the determinant of 
the resulting matrix, we find that it reduces to the quartic equation 
A* — 2:019237640 A3 + 1-343416989 A2— 0-355747132 2+0-031567594 = 0. 


Since one root of this equation must be unity, the left-hand side has a factor (A—1), 
and we may write it as 


(A—1) (A?—1-019237640 2? + 0-324179349 A—0-031567783) = 0. 


We are thus left with a cubic equation, which is solved by standard methods. The 
roots are the squared canonical correlations 


Ri = A, = 0-48516, 
iS = 14,= Pea 
RZ = A, = 0-18803. 
It will be noticed that R,; = 0-697 is not very much larger than the correlations of 
0-69 obtained with natural number and normal scores in Example 33.9. 
We now require the latent vectors corresponding to 4,. We first solve the set of 
equations for the elements of u, 


NN’u, = 0-48516u,, 


and find on dividing the elements u; of u, by m,? that the canonical scores for the row- 


(33.103) 
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categories are 


u,/n,? — 1-307 

so ee pt (33.104) 
Us/Ns? +0:°739 
u,/n,4? + 1-362 


The canonical y-scores are obtained from (33.99). ‘Thus, for example, 


y, = {(821 x —1-307)+(116 x +0-021) +(72 x +0-739) +(43 x +1:362)}/ 


{1052 (0-48516)? }. 
The set of scores is 
— 1-309 
_ | +0-040 
y= 40-730 | (33.105) 
+ 1-406 


The sets of scores (33.104) and (33.105), when weighted by the row or column fre- 
quencies, have zero means but have had to be adjusted so that their variances are 
unity, since latent vectors have an arbitrary scale constant. 


33.50 We now recall the implication of Lancaster’s theorem discussed in 33.46: 
the choice of row- and column-scores to maximize the correlation between the categ- 
orized variables is essentially equivalent to transforming the margins of the table to 
univariate normality, with the intention of rendering the body of the table bivariate 
normal. Let us, therefore, apply to the data of Example 33.10 the normal scoring 
system outlined in 33.42. We shall then be able to see whether the resulting scores 
for the categories agree well with the canonical scoring in Example 33.10. 


Example 33.11 


The two sets of proportional marginal frequencies in Example 33.8, together with 
the corresponding ranges of a standardized univariate normal distribution (obtained 
from the Biometrika Tables) are: 


$i% = Corresponding normal a & Corresponding normal 
range (ai, bi) n range (ai, bi) 


wee fc «= — 4551) | 03248 «= (— mo, 04543) 
0:2440  (—0-4551, +0-1726) |  0-2412 (—0-4543, +0-1662) 
0-2835 (+0-1726, +1-0450) |  0-2755 (+0-1662, +1-0006) (33.106) 
0-1480  (+1-0450, co) | 0-1585  (+1-0006, 00) 


1-0000 1-0000 


The mean value within a range (a,, b,;) of a standardized normal distribution containing 
a fraction p, of the distribution is 


bs 
— * { (2m) te-¥ dt oc ole tte), (33.107) 
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We neglect the factor (27)-?, since we are interested only in correlations, and scale 
changes in scores do not affect this. The values of the scores are found, using 4-figure 
logarithms, to be: 


Row scores Column scores 
—2°778 —2°777 
— 0-343 —0-350 
+1-432 +1-380 
+3:914 +3-825 


Unlike the scores in Example 33.10, these are not exactly standardized, even if we 
restore the neglected factor (27)-?, for the use of means within ranges of the standard- 
ized normal distribution introduces a grouping approximation. We find, for the 
sums and sums of squares of these scores (weighted, of course, by row and column 
marginal frequencies respectively) : 


Row scores Column scores 
Sum Ea = .. +98 —94 
Sum of squares .. fee Fo. 16,984 
Mean == ae .. 0:030 —0-029 
Standard deviation aes 2°289 


Adjusting the scores. by subtracting the mean and dividing by the standard deviation, 
we obtain: 


Standardized row scores (x) Standardized column scores (y) 
— 1-194 | — 1-205 
=<hTS9 — (0-140 
+.0:596 40-616 ie) 
41-652 + 1-684 


The scores (33.108) agree only rather roughly with those in (33.104-5). We may only 
regard the method of the present example as giving a crude approximation to the 
canonical scores. 


Partitions of X*: canonical components 


33.51 ‘The canonical correlations discussed in 33.44-9 have a close relationship 
with the X? statistic (33.62). Consider again the matrix NN’ defined in 33.49. Its 
diagonal elements are 


~ 
(NN’);; = p> pecs 


j Ni. nN. 
and thus we have 
2. X2 
ttNN’ = 5 Bo = 41, (33.109) 
a3 ts ** 9 


by (33.62). Remembering that the trace of a matrix is the sum of the latent roots, and 
that the latent roots of NN’ are 1, R?,..., R*%_1, we therefore have from (33.109) 


X? = n(R?+-R3+...4+R2_)). (33.110) 
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We thus display the squared canonical correlations, multiplied by m, as components or 
partitions of X?. 

It is tempting to suppose that the components in (33.110) are themselves asymptoti- 
cally independent x? variables on the hypothesis of independence, the degrees of free- 
dom of nR? being (r—s)(c—s)—(r—s+1)(e—s+1) = r+c—2s—1. However, Lan- 
caster (1963) shows that this is not so. 


Example 33.12 
In Example 33.10, we have m = 3242, and the squared canonical correlations are 
given by (33.103). Thus, by (33.110), 


X? = 3242(0-48516 + 0-34604 + 0-18803) = 3,304, 


as may be verified by direct calculation in Example 33.8. 


33.52 ‘There are many (indeed an infinite number of) other ways in which X? 
may be partitioned; the formal structure of such partitions was given in 30.44. 
Whether a particular partitioning has statistical interest depends on the purpose of the 
analysis. As a preliminary, it should be noticed that X? itself is, in fact, a component 
of a larger such quantity, which we shall denote by X7. 

We no longer restrict ourselves to ordered tables, but consider only the independence 
case, when p;; = p;.p.; for all 7, 7. The probability of the observed frequencies 7,; is 
then 


! J 
P{nj;|pij,n} ae Ting Il n,;! ; IT (Dis) ‘= us pee nt (Se) (33.111) 


ee 
nlp a ee Tl ny, MII, ! 


IT, “Tin; eae = ere (33.112) 
j t, j 


just as in 33.23 for the 2x2 table The left-hand side of (33.112) and each of the 
three factors on its right can be approximated by x? distributions. Writing 


Ny = NPi + C43, 
we find on using Stirling’s series that the left-hand side of (33.122) is asymptotically 
log P{n,,| pin} = te 2. /(np;;)}- 
Thus 


2 
p= oe — . 4) (33.113) 


is asymptotically distributed like the sum of squares of rc standardized normal variates 
subject to one linear constraint (2 &m,; = m) and is therefore a x? variate with rc—1 
tj 


degrees of freedom. 


— . /2 
Similarly, xi 5 hel (33.114) 
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is asymptotically a y?_., 


(1,;—mP.,)° 
Ae SS S3.415 
a energy ( ) 
is asymptotically a y?_,, and we already know that the “‘ ordinary ”’ X?, which we now 
write 


(n,;—n;,n,;/n)? 
Se Se ee 
Ano = : : n;,n,;/N 


(33.116) 
is asymptotically a y? with (r—1)(c—1) degrees of freedom. (33.113-16) give the 
asymptotic partitioning 

Xp = XZ4+X$4+ Xie (33.117) 


which is, in fact, a way of reflecting the factorization (33.112). Degrees of freedom on 
the right of (33.117) add to the degrees of freedom on its left, i.e. 


re—1 = (r—1)+(c—1)4+(r—1) (c—1). (33.118) 
Thus we see (as we did in 33.29) that the degrees of freedom (r—1)+(c—1) on the 
right of (33.118) are lost to the ‘‘ ordinary ” X? (which is Xj, in our present notation) 
because we have to estimate row- and column-probabilities from the table—if these 
were known a priori, (33.113) could be used instead. ‘This is merely another instance 


of the loss of degrees of freedom due to estimation of parameters, which we remarked 
in 19.9. 


Example 33.13 (Lancaster, 1949b) 


A sampling experiment was (in effect) conducted nine times according to variations 
of factor A (threefold) and factor B (threefold). ‘The frequencies (which were occur- 
rences in sampling from Poisson populations) were as follows : 


3,009 2,832 3,008 | 8,849 
3,04/ -.-3,051-: 2,997 | 9,095 
2,974 3,038 _ 3,018 | _ 9,030 


9,030 8,921. 9,023 | 26,974 


This is one of the relatively infrequent cases where we have prior marginal probabilities. 
Here p;; = (1,7 = 1, 2, 3) and p;, = p,; = 4. Using equation (33.117) we find: 


Degrees of Critical value 


Component Value freedom a= 0: 
x 3-615 2 5-991 
X5 0-828 2 5-991 
=< 11-864 8 15-507 
Sie fe s: 7-421 4 9-488 


None of the three values Xf, X3, X? exceeds its random sampling limit given in the 
last column. ‘The conditions of experimentation seem to have been about constant. 
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Had we not possessed information about the marginal probabilities, we should have 
had to estimate them. We then find X2, = 7-547 with 4 d.f. This is not quite the 
same as the value of 7-421 for X?—X?—X2 in the above table, but the difference is 
trivial. It is, of course, due to the fact that the partition (33.117) is strictly an asymp- 
totic one. 


33.53 By essentially the method of 33.52, Lancaster (1949b) partitions X? for 
an rxc table into (r—1)(c—1) components, each having a single degree of freedom. 
Each degree of freedom corresponds to X? for a particular 2 x2 classification of the 
table. We shall not give the details here, but the method is easily understood from 
two examples. The 2x3 table 


Niy Myo Nyy My, 
No, Neg Neg Mg, 


Ny Ng Ng | Nn 


for which X? has 2 degrees of freedom, has the 2x2 component tables: 


M1, Myo | My +My (%31+y2) M3 my, 
Noy Neg Moyt+Mgq and (M%gy+Mq2) Mos | Ng, (33.119) 
Ny Neg NytrN,»s (1.1 +7,9) 1.3 1 


If X? is calculated for each of these 2x2 tables in the ordinary way, their sum will 
approximately be the X? of the original 2x3 table. Similarly, for a 3x3 table, with 
4 degrees of freedom, the four component 2 x 2 tables are : 


Nyy N12 | NyitNy2 (141 +742) N13 Ny, 
No Noo | Noi +Noo (191+) Nog No, 


(1041 +51) (%42+M22) | (yy +MygtMgy tye) (yy +Myg+Mq1+ Moo) (13 +Me5) | My, +p, 


(7041 +191) (4124+ Mg0) | (4, +Mygt+Mgy+Mgn) (yy +Myg+Mqy+Mge) (143+Me3) | 14, +o, 


N31 N30 | (131 +139) (131 +13) N33 Ns, 
1 Ne | y+. (1.1 +7. 2) 1. n 
(33.120) 


The procedure is quite general, but must be used with care as the partitioning is not 
unique (since rows and columns may in general be permuted). The components are 
only additive asymptotically, as in Example 33.13. 


Lancaster (19495, 1950) and Irwin (1949) give a method of partitioning X? exactly 
into (r—1) (c—1) components corresponding to 2 x 2 tables, but the approximate partition 
is good enough for most practical purposes—cf. Exercise 33.15. A. W. Kimball (1954) 
simplifies the computations for the exact partitioning. 
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33.54 Other types of partitioning of X? are discussed by Cochran (1954) in a 
review of methods for rxc tables (and, indeed, also for goodness-of-fit tests) which 
includes a discussion of the problems of handling tables with small hypothetical (inde- 
pendence) frequencies in some or many cells without destroying the y” approximation 
to the distribution of X?._ His recommendations are that if only 1 cell out of 5 or more, 
or 2 cells out of 10 or more, have hypothetical frequencies smaller than 5, a minimum 
hypothetical frequency of 1 is allowable. If there are more such cells, a minimum 
hypothetical frequency of 2 is usually adequate if there are fewer than 30 degrees of 
freedom. For more than 30 d. of f., the exact mean and variance of X? given by 
Haldane (1939) should be used and X? taken to be asymptotically normal with these 
moments. 

For ordered tables, Quenouille (1948) in an unpublished paper gives partitions of 
X? which extract linear, quadratic, etc., components. 


2x c tables: the binomial homogeneity test 


33.55 A particular case of the rxc table which is of special interest is the 2 xc 
table, where we are comparing c samples in respect of the possession or non-possession 
of an attribute. The general formula (33.62) for X? reduces here to 


——— : 2 
> {ee > (133 Ni. n,;/N) 
j=1 j=1 %, n,,;/n 


where n.; is the jth sample size and nm = &n,, as before. Useful exact and approximate 
j 


(33.121) 


methods of calculating (33.121) are given in Exercise 33.21. If we write 


p=n,,/n 
for the ML estimate from the table of the probability of observing a “‘ success ”’ (i.e. 
an entry in the first row of the table), (33.121) may be expressed as 


—— (n,;—1.;p)? (scm) AIP) 
2 Nn. 5 p s n,;(1—f) 
= : (n1;—1.;p) 
jai 1.5P(1—) ’ 

distributed asymptotically as y? with c—1 degrees of freedom. ‘The test of the homo- 
geneity of the c binomial samples based on (33.122) is thus seen essentially to be based 
on the sum of squares of c independent binomial variables each measured from its 
expectation (estimated on the hypothesis of homogeneity) and divided by its estimated 
standard error {n,,p(1—p)}?. There are c—1 degrees of freedom because we esti- 
mate the expectation linearly from the data—if it were given independently of the 
observations as p, we would replace f by p in (33.122) and have the full c degrees of 
freedom for X?. 

Armitage (1955) gives an expository account of tests for trend in the probabilities 
underlying an ordered 2xc table, which are essentially applications in this simpler 
situation of the rank-order and scoring methods which are discussed generally earlier 
in this chapter. 


I=1 


(33.122) 
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The Poisson homogeneity test 


33.56 Now consider what happens to the statistic (33.122) when the hypothetical 
underlying binomial distribution tends to the Poisson distributions in the classical 
manner of 5.8, so that 

hg. 0; P= 0, nN, ;p—*> A. 

We then have c independent observations on this Poisson variable, namely the 1,;. 

The statistic (33.122) with p replaced by p reduces to 


O-fa¢ -— 9\2 
xe yg (33.123) 
jaa A 

which is asymptotically a y? variable with c degrees of freedom as in 33.55. If, as is 
usual, 2 must be estimated, we use the complete sufficient unbiassed estimator 1,./¢ = 71, 
which is the mean of the c observations. ‘Thus 


c a 
eee eee, (33.124) 
j=l ny 

has (c—1) degrees of freedom as a test of homogeneity of c Poisson frequencies, a 

degree of freedom having been lost by the estimation process just as for (33.122). 
The tests of this and the last section, which are due to R. A. Fisher, are some- 
times called the dispersion tests of the binomial and Poisson distributions. ‘This is 
because each is the sum of a number of c terms, each term being the ratio of a variate 
squared about its estimated expectation to an estimate of its variance—in the case of 
(33.124), the Poisson population mean and variance are equal, so that 7, estimates both. 


33.57 Cochran (1954) gives a detailed account and bibliography of the binomial 
and Poisson dispersion tests, and especially of the partitioning of degrees of freedom 
from X? in each case. 

It appears, in particular, that the dispersion statistic (33.124) often gives a more 
powerful test of the hypothesis that a sample originated from a Poisson distribution 
than does the X? goodness-of-fit test based on grouping the observations into the fre- 
quencies with which the values 0, 1, 2,... are observed. ‘The basic reason for this is 
that for Poisson distributions with small values of the parameter /, the observed fre- 
quencies fall off sharply after a certain value, which is as low as 4 or 5 if A is 1 or less (cf. 
Table 5.3 and Example 19.11). Thus, unless is extremely large, a goodness-of-fit 
test can only have a few degrees of freedom (about 5) since the values in the upper 
tail must be pooled into a single class to obtain a sufficiently large hypothetical fre- 
quency for the test to be valid (cf. 30.30). This does not apply to the dispersion test, 
where the number of degrees of freedom is equal to c—1, one less than the number of 
observations, no grouping being necessary—this point is perhaps obscured by our 
derivation of the test through the 2 x c table, but is clear from (33.124) directly. Thus, 
for ‘reasonable’ sample sizes, the dispersion test may be expected to be more 
powerful. 


Potthoff and Whittinghill (1966a, b) give other statistics for testing binomial, multi- 
nomial and Poisson homogeneity. 
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Armitage (1966) considers the moments of the X2 dispersion tests when the data are 
sampled from a population stratified into sub-populations. 


Multi-way tables 


33.58 Except in 33.9-10, we have been considering the relationships between 
two categorized variables; our rxc tables have been two-way tables. It is natural 
to generalize the problem to p > 3 variables categorized in multi-way tables, or, as they 
are sometimes called, complex contingency tables. This was first done by K. Pearson 
(1904, 1916) for an underlying multinormal distribution. If the pth variable is poly- 
tomized into r, categories, we have ar, x1r,X ... X1p table, which can only be physi- 
cally represented in p dimensions. In the simplest case when p = 3, we can represent 
the 7; x7, x7, table as a solid with cells arrayed in rows, columns and “ layers,” and 
to avoid subscripts we shall use the initial letters as in the two-way case and call this 
a rxcxl table. In point of fact, the three-way table is the only multivariate one 
which has received more than formal attention in the literature, since no new theoretical 
points arise when p > 3; but we shall see that the generalization from two to three 
dimensions does introduce new considerations. 


B. N. Lewis (1962) gives an extended review of the subject. 


33.59 Let us first consider the approach of 33.52, where we partitioned the two- 
way r xc table in the case of independence. If we write m,;, for the observed frequency 
and p,;, for the probability in the cell in the ith row, jth column and kth layer, the 
hypothesis of complete independence is 


Hg: Piste = Pi. D.5. P..te (33.125) 


where a dot denotes summation over that subscript as before. In the two-way case, 
we had the partition (33.117-18) into ‘‘ rows,” ‘“‘ columns” and ‘“ rows x columns ” 
components, with (r—1), (c—1) and (r—1)(c—1) degrees of freedom respectively. In 
the present three-way case, we have the asymptotically additive components : 


Component Degrees of freedom 

Rows: 92 2 a - r—1 
OCelimns ©5592 22 Jee c—1 
Oy as pee eee 5 J 1-1 
Rowsxcolumns .. .. X35 (r—1)(c—1) (33.126) 
Rowsxlayers .. .. X2, (r—1)(/—1) 
Columns x layers 2 OL (c—1)(/-1) 
Rows x columns x layers X2,, (r—1)(c—1)(J-1) 

TOTALS - rcl—1 


In the 2x 2x2 table, each component in (33.126) has 1 degree of freedom. 
If we regard the rxcx/ table as a parallelepiped, the variation is thus expressed 
first of all in terms of edges, secondly in terms of faces, and finally in terms of the main 


body of the table. 
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The individual components in (33.126) are easily calculated. If there are hypo- 
thetical probabilities for any or all of the p;..,p.;.,p..x, the corresponding components 
(X2, X2 and X?) are simply the goodness-of-fit X? values for the row, column and 
layer marginal distributions, taken separately. If there are no hypothetical prob- 
abilities in any case, the corresponding one of these components is identically zero. 
We now compute the “ ordinary” X? for testing independence in each of the three 
two-way tables. From the (r xc) table X?, we subtract (Xg+X@); from the (rx /) 
table X2, we subtract (X¥3+X?); and from the (cx /) table X?, we subtract (X¢+ X7). 
The results are X2,, X2, and X32, respectively. Finally, we compute the xX? for 
testing independence in the (rxcx/) table. This is X7, and Xj, is obtained by 
differencing. | 


Example 33.14 (Lancaster (1951), quoting data of Roberts et al.) 

The following show some data for rats in a 2x 2x2 table classified according to 
whether they do or do not possess attributes, A, B, D. As before, we use «, f, 6 to 
denote absence of the attributes. 

The basic frequencies are : 


(ABD) = 475 («BD) = 467 
(AB6d) = 460 («Bd) = 440 
(ABD) = 462 (aBD) = 494 
(AB6) = 509 (Bd) = 427 
We arrange these in three 2x2 tables thus: 
Attributes «a A Torats | Attributes 4 D ‘Tortats | Attributes 6 D 'TOoTALs 


B 2t= 971..: 1892 B 936 956 1892 OL 867 961 1828 
B 907... 935. 1842 B 900 942 1842. A 969 937 1906 


Torats 1828 1906 3734  Torats1836 1898 3734 | Torats 1836 1898 3734 
The hypothetical probabilities of all the attributes are 3. Thus for A we have 
» _ (1828—3734/2)? _ (1906—3734/2)? _ ,. 
X?= 1867 a ae 1-6294. (33.127) 
Similarly, we find for the other components the values in the third column of the 
following table: 


Component Degrees of X? (prior hypothetical X? (parameters 
freedom probabilities) estimated) 
A 1 1-6294 0 
B 1 0:6695 0 
D 1 1-:0295 0 
AB 1 0:1296 0-1176 (33.128) 
AD 1 4-2517 4:3426 
BD 1 0-1296 0-1397 
ABD 1 2°7863 2:6904 
TOTALS 7 10-6256 7:2904 
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For one degree of freedom, the 95 per cent point of ? is 3-84 and the 974 per cent 
point is 5-02. ‘The only component near these values is the AD term, which lies between 
them. If there is any connexion between the factors at all, therefore, one would look 
for it between A and D, but the hypothesis of independence is not very strongly suspect. 

2\i 

In fact, we find V4p = (=2) = 0-03. In any case, the component ABD is with- 
in sampling limits, and if A and D were connected, we should expect the ABD com- 
ponent to be large. Furthermore, we must bear in mind, as always when partitioning 
X°, that the separation of a single test into a number (here 7) increases the probability 
of some component falling outside its random sampling limits. On the whole, there- 
fore, the conclusion seems to be that all three factors are independent (or so weakly 
dependent that there is no decisive indication of interdependence). 

Had we not had prior probabilities but estimated them by marginal frequencies, 
we should have obtained the values of X? in the last column of (33.128). The values 
are very close to the previous ones, as they should be, and the same conclusion is 
reached. 

It may be noted that we might have had prior information about some of the prob- 
abilities but not of others. In such a case, we should estimate those unknown and 
proceed as before. 


33.60 ‘The nature of the general multi-way table makes it possible to consider 
a large number of hypotheses other than that of complete independence, stated at 
(33.125). 5S. N. Roy and Mitra (1956), who make similar distinctions regarding the 
structure of multi-way tables as we did for 2x2 and rxc tables in 33.18 and 33.31 
above, develop large-sample X? tests for a number of these. For example, we may 
wish to test 


H, : Dist = Pie Pix (33.129) 
P..k Die Pi..k 
which states that in a layer of a three-way table, the row and column variables are 


independent. This is the analogue of a zero partial correlation between rows and 
columns with layers fixed. Or we may wish to test 


Ly: Pisx = Piz. Pie (33.130) 


which asserts the independence of the row-column classification, considered as a 
bivariate distribution, from the layers. This is the analogue of a zero multiple cor- 
relation of layers upon rows and columns. 

By summing the two sides of (33.130) first over 7 and then over 7, we see that it 
implies both 


Pik = Pi.. P..% (33.131) 


and 


Pte = Pi. Pi. (33.132) 
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However (33.1312) do not alone imply (33.130). 5S. N. Roy and Kastenbaum (1956) 
have investigated what additional hypothesis was necessary to ensure that (33.131-2) 
lead to (33.130). They rejected because of its mathematical intractability the natural 


Hy: Pigy = Pi: Bese Pek 33.133 
ee Ps. Pack Pi. ( ) 

and instead suggested 
Fg: Pisn = ij An Qins (33.134) 


where the a’s are arbitrary positive numbers. They show that (33.134) and (33.131-2) 
imply (33.130). (33.133) may be otherwise expressed as in Exercise 33.31. 


33.61 In accordance with the terminology of the Analysis of Variance (Volume 3), 
(33.134) is the hypothesis that the second-order interaction in the table is zero. This 
problem was first considered for the 2x2x2 table by Bartlett (1935b) and for the 
2x2xI1 table by Norton (1945). Lancaster (1951) proposed an alternative method 
based on the component Xfcz, in (33.126), whose interpretation as a test of second-order 
interaction is critically discussed by Plackett (1962)—-see also an illuminating discussion 
of the analogies with Analysis of Variance by Darroch (1962). Plackett proposed 
another test which is simplified by Goodman (1963b), who generalizes (1964c) the 
method to interactions of any order, discusses (1963a) the 2x2 x/ case, gives (1964a) 
other methods based on cross-ratios, and gives (1964d) simple methods of testing and 
obtaining confidence intervals for second-order interactions. 


Lindley (1964) gives a related Bayesian analysis of contingency tables. 


33.62 Birch (1963) considers ML estimation of parameters in multi-way tables. 
He also (1964c) discusses a test for the existence of partial association in a 2X2xl 
table, due essentially to Cochran (1954), and based on the statistic m,., approximately 
normal with mean and variance obtained by summing those in (33.51) over the / layers. 


If & = log ae H, is that all 0, = 0, and the test is UMPU against H;: all 0 equal 
12k £21k 


and positive. Estimation of the assumed common value of 9% is also discussed. ‘Testing 
that the 6; are equal is equivalent to testing the second-order interaction—cf, 33.61. 
The theory is generalized to rx cx tables by Birch (1965). 

Lancaster (1960) has extended the ideas of canonical analysis to the multi-way table. 


EXERCISES 


33.1 Show that the coefficient of association Q is greater in absolute value than the 
coefficient of colligation Y, except when both are zero or unity in absolute value. 


33.2 Derive the standard error of the coefficient V given at (33.17). 
PP 
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33.3. For the 3 x 4 table (from Ammon, Zur Anthropologie der Badener) 
Eye colour Hair colour group 
group B, B, BB TMs 


A, | 1768 807 189 47 | 2811 
A, | 946 1387 746 53 | 3132 
A, | 115 438 288 161 857 


ToTaLs | 2829 2632 1223 116 6800 =n 


show that X? = 1075-2, and hence that the coefficients (33.63-5) are 


P = 0:3695, 
T = 0-2541, 
C = 0-2812. 


33.4 Show that for an r X c table the Pearson coefficient of contingency P is equal 
to the Tschuprow coefficient T for two values of X?/n, one of which is zero; that for 
X2/n between these values P > TJ, and for X?/n greater than the higher value T > P. 


33.5 In experiments on the immunization of cattle from tuberculosis, the following 
results were secured :— 


Table 33.4—Data — Report on the Spahlinger Experiments in Northern 
Ireland, 1931-1934 


(H.M. Stationery Office, 1935) 


Died of Tuberculosis Unaffected or only 


or very seriously slightly affected 
affected TOTALS 
Inoculated with vaccine .. 6 13 19 
Not inoculated or inoculated with 
control media .. = a 8 3 11 
"TOTALS 14 16 30 


Show that for this table, on the hypothesis that inoculation and susceptibility to 
tuberculosis are independent, X? = 4-75, so that the hypothesis is rejected for « > 0-029 ; 
that with a correction for continuity the corresponding value of « is 0-072; and that by 
the exact method of 33,19-20, « = 0-070. 


33.6 Show that if two rows or two columns of an r X c table are amalgamated, 
X®? for testing independence in the new table cannot be greater than X? for the original 
table, and in general will be less. 


33.7 Show that if f is a standardized p-variate normal distribution with dispersion 


matrix V and marginal distributions f,, fo, .... , fo, 
= (f—-fife..-fy)? 1 
2 — — en dx eee dx — ao, 
ace Gt ois ee See 


where W = 2I-V. 
(K. Pearson, 1904) 
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33.8 In a multi-way table based on classification of a standardized p-variate normal 
distribution according to variates with correlations pij, show that 


log (1+¢?) = —4log|I+P| —}log|I—P|, 


where ¢? is defined in Exercise 33.7 and P is the matrix with elements pij, 1 #7 and 0, 
i =j. Hence, by expanding, show that 


¢? > 4trP? = Zi py. 
i<j 
(Lancaster, 1957) 


33.9 For the r xc table with both sets of marginal frequencies fixed, consider the 


statistic 
H r c ni ni 2 r Ni. c D2. 
>= a nNij — i 8.2. = jp hse ij 
n i=1 j=1 n i=1 NM ja1Mi,ny/n 


which is a weighted sum of the contributions of the rows of the table to X? at (33.62), 
the weights being the proportional row frequencies m:,/n. Show that on the hypothesis of 
independence 


E(H) = (c—1)n(n?— Xnj)/(n—1), 


4{(Ln})*—n Dink} 
var 7 = ng ee (n-2)-4(e-1) 0-0} 
ce (n*—Ln})(Unz—n) 40+6) iS 
Ee a x7 4 7 


GIG Is) i 2) @—D) 


<{mer— (<-147- BT) tenth 
7 e 


n? 
If all row marginal totals are equal, so that 7, = n/r, show that H = Xe, and that 


n® 
E (1) = aay 


= an (nF) Fe 2 = 11: is 
var H = ei ae oe (n 1) (« I)+— ET) te 1) \ 
so that E (X*) —> (r—1)(c—1), 

var (X?) —> 2 (r—1) (c—1), 


as they must since its limiting distribution is of the x? form with (r—1)(c—1) degrees 
of freedom. 


(C. A. B. Smith (1951-2); cf. also Haldane (1939)) 
33.10 Show as in Example 33.8 that for Table 33.5 & a,j; bj; = 2 x 13,264,256 and 
t,j 


hence show that 


tp = 0-658, 
te = 0-633. 
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Table 33.5—7477 women aged 30-39 employed in U.K. Royal Ordnance 
factories 1943-6 : unaided distance vision 


Left eye Highest | Second Third Lowest 


grade grade grade grade TOTALS 
Right eye 
Highest grade 1520 266 124 66 1976 
Second grade 234 1512 432 78 2256 
Third grade 117 362 1772 205 2456 
: Lowest grade 36 82 179 492 789 
TOTALS 1907 2222 2507 841 7477 


33.11 Inthe 4 x 4tables of Example 33.8 and Exercise 33.10, show that the coefficient 

G defined by (33.79) takes the values 0-776, 0-798 respectively, with maximum standard 

errors given by (33.80) as 0-022, 0-014 respectively. Show also that the maximum 

standard errors obtained for the f, values of 0-629, 0-633 are 0-029, 0-019 respectively. 
(Goodman and Kruskal, 1963) 


33.12 The following data, due to D. Chapman, relate the conditions under which 
homework was carried out (rated from the best, A,, to the worst, A;) and the teacher’s 
assessment of the quality of the work (from best, B,, to worst, B;): 


A, A, A, A, -Ag\ 1OTMS 


By 143 G7. = 14 79 9 = 0 
B, 131-66 45 7 Se 
B; 36... 14. 238 — 2 1s ee 


Torats | 308 147 295 179 90. 1019 


Show by assigning natural-number scores to the categories that the regression coefficient 
of homework quality rating upon homework conditions (0-025 in these units) is within 


ordinary sampling fluctuation limits of zero, its standard error being 0-016. 
(Yates, 1948) 


33.13 The table below, due to A. R. Treloar, relates the periodontal condition of 
135 women to their average daily calcium intake : 


Average grams of calcium per day 


0-0:-40 0:40-0:55 0:55-0:70 over 0-70 
A § 3 10 11 
Periodontal B 4 < 8 6 
condition C 26 11 3 6 
D 23 11 1 2 
Show that the canonical correlations are 
x, = 50275, nR? = 42:-74984, 
R, = 0:10869, so that<nRZ = 1:59497, 
R, = 0:00045, nR3 = 0-00003, 


X? = 44-34484, 
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only the first of these components of X? exceeding conyentional critical values for x? 
distributions. Show that the canonical scores corresponding to R, are :— 


Periodontal condition Calcium intake 
A: —1:3880 0-0-40 : 0:8397 
B: -—1:0571 0:40-0:55 : 0-4819 
= 0:6016 0:55-0-70 : —1-5779 
D: 0:9971 over 0:70: —1-1378 


In particular, note the change of trend in the calcium intake scores at 0°70, which 
confirms the impression from the data that there seems to be a limit above which increased 
calcium intake does not further improve periodontal condition. 


(E. J. Williams, 1952) 


33.14 In Example 33.10, show that 

0:633768533 0-192522708 0-146101456 0-092877193 
0:192522708 0-444704410 0-241885812 0-093129969 
0:146101456 0-241885812 0:-474670557 0-200085907 
0:092877193 0:093129969 0-200085907 0-466094141 

and show that its four latent roots are given by unity and (33.103). Use (33.98) to 


evaluate the vector of y-scores (33.105) directly, and thence obtain the x-scores (33.104) 
by use of (33.99). 


N’ N= 


33.15 For the data of Example 33.13, show that the four 2 x 2 tables (33.120) yield 
components 
2:860, 2:180, 
2:526, 0-005, 


totalling 7-571, as against X? = 7-547 for the original 3 x 3 table. 
| (Lancaster, 1949b) 


33.16 In 33.53, use the method of 33.52 to demonstrate the asymptotic partition 
of X? for the 2 x 3 table into single (2 x 2) components (33.119), and show that the 


argument may be extended to the general r x c table. 
(Lancaster, 1949b) 


33.17. In the multinomial (p;+p2.+ ..-. + pr)”, define 
= ni —npi 
— {npi(1—pi) 
Show that, in the notation of partial correlations and regressions, 
pig = — {pip;/(1—pi) (1—p) #, 
pijat...m = —(pids/L{A—pi)— (Pet bit «.. +Pm)}{—pi)— Ont bit ..- +Pm)}), 
of = 1, of, = {1—(pit+py)}/ {1 —ps) 1 —-D)}, 

Bijat...m = — {pipi1—pj)/A—p—a) #/{1—pilbet bit «-- +Pm)}- 
Show that w/o, We.1/02.1, 3.21/93. 21, etc., are asymptotically normally distributed with 
zero means, unit variances and zero correlations. Show further that wx,12..(n-1) = 0 and 


hence that the first (k—1) w’s provide a partition of X? into (k—1) independent com- 
ponents. 


Wi SS eS Sees SF 


(Lancaster, 1949b) 
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33.18 Let C and D be orthogonal matrices of rank r and c respectively and let the 
(rc X rc) matrix K be their direct product C xD and the variables 
wig = (nig — Pi, D.j)/(n Pi. D5)? 
be arranged as a column vector 
= (011, X12, .- 5 X1g, Xa, Mae, +++ 9 Xoey +++ Xry,Xrg,+--, Xr) 
Show that by a suitable choice of the elements of C and D, the matrix Y = Kx gives the 
components of X? for the r x ¢ contingency table, y,, referring to the total, y4z(k ¢ 1) the 


column totals, yz,(& ¢ 1) the row totals and the remaining terms the other (r—1) (c—1) 
degrees of freedom (cf. 33.52). 


(Lancaster, 1951) 


33.19 In the particular case of a 3 x3 table in Exercise 33.18, take 


a See 
Ys Vw 
1 -—-1 
= = eee ee 
C=D y2 V2 
= = 
V6 ~VW6 +6 
For the data of Example 33.13, obtain the matrix of x-values 
0 —0-812829 —0-409012 
( — 1-834459 1-653094 1471167 
—0:-499424 -—1-:587173 —0-070020 


and hence verify the table of that example. 


33.20 Show that for a 2 xc table with ,, = n,., X? reduces to the form 
x2 = > (133 — 195)" 
j=1 NyjtNg; 


This is the test statistic for testing the homogeneity of two equal-sized samples polytomized 
into ¢ categories. 


33.21 In the 2 xc table, suppose that 7,, > m2,, and choose small integers k, h with 
k > h 2 1 such that k/h approximates ,,/n,, and hence (h N1,—knz2,)/n is small. Show 
that for the table, (33.121) may be written exactly as 


2 c 
ee a: { x (hsm) /ng~ (hm, Ra /a, 
° = Se 


and approximately as 


1 c 
xX? = —1 & (hnyj—kn2;)?/nj—(hny,—k ng,)2/n 
AR | 721 
with an error of a factor less than p— Om. Boa) | 


(Haldane, 1955b) 
33.22 For the 2 x 11 table, 


25-80: 38-529: -21.. 33.24 30. 51... 56 | 419 
121532) S282 7 tee ae 


— ee eee 


26-95-50 60 9 28 39 26 37 58 59 4g? 
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use the results of the last exercise with h = 1, k = 6 to show that X*? = 16-709 by the 
exact formula, and 16:393 by the approximate one. ‘This is an approximation error of 
1:9 per cent, against the 2-3 per cent maximum allowed. 

(Haldane, 1955b) 


33.23 Show (cf. Exercise 33.20) that in a r X7 contingency table, we may test com- 
plete symmetry (i.e. Hy: piy = py, i,j = 1,2,...,7, where pij is the probability of an 
observation occurring in the ith row, jth column) by the statistic 

a (naj — ji)? 
i<j Mig tny 
asymptotically distributed as x? with 4r(r—1) degrees of freedom. 
(Bowker, 1948) 


33.24 Show that in a r xr table with identical categorizations (cf. 33.2) we may 
test the homogeneity of the two underlying marginal distributions (i.e. Ho: pi, = Di, 
i= 1,2,...,7) by the quadratic form in the (r—1) asymptotically normal variables 
a= iu —n), 1 = 1, 2,...,7-1, 

r—1 r—1 
Q=daV'd=2 X Vd d;, 
i=1 j=1 
where V4 is the (z,j)th element of the inverse of the (r—1) x(r—1) dispersion matrix V 
of the di, whose elements are 
Vit = mi, tni—2nu, Vig = —(nit+nji), tF 7. 
Show that QO is asymptotically distributed as x? with (r—1) degrees of freedom. 


(Stuart, 1955b; Madansky (1963) investigates the LR Test 
for marginal homogeneity in a multi-way table) 


33.25 Show that in arx r table, the hypothesis of common proportionality of the 
diagonal cells, i.e. 
fps = oe 
"pip DI.DS 
may be tested by the statistic 
2 nit 
Nii = - 


M.ni~ UM,Ns 
V= Dn.ns ? 
¢2=1 


4 
Di ni / uN, 1,4 
é i 


asymptotically distributed as xy? with (7-1) degrees of freedom. 
(J. Durbin; cf. Glass (1954), p. 234) 


33.26 Verify the values of the two sets of components of X? in (33.128), using the 
method given below (33.127). 
33.27 Re-arrange the frequencies of Example 33.13, except 3018, in the 2x2 x2 
table 
3009 2832 | 3008 2974 
3047 3051 | 2997 3038’ 
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the rows and columns being as laid out and the figures on the right being a ‘“‘ layer ”’ 
above those on the left. Obtain the following partitions of X?, the hypothetical prob- 
abilities all being 4: 


X* (hypothetical X* (parameters estimated 

(probabilities) from data) 

R 4:0115 0 
C 1-°1503 0 
L 0:2540 0 
RC 21357 2°7824 
RL Liars 1°7547 
LC 1°3524 1:3607 
RCL 0:4690 0:5107 
11-7101 6:4085 


(Lancaster, 1951) 


33.28 In a2x2x2 table, let 
T= (p1..)*, U= (p.1.)*; V= (p..1)*, 
[= (po..)?; == (p.2.)?; — (p..2)?, 
and the (8 x8) matrix M be the direct product of the matrices 


Veo - a”: © (ee 
—v V —u U —t T)} 
Let xige = (nign—n Pi, Di. P..e)/(mpi.. Ps. P..k)*) 
and x represent the column vector 
X = (%y11 Xo11 X1291 Xo01 X112 X2129 X199 Xa20)- 
Show that the elements of Mx are the components of (asymptotically independent) y 
variables for, in this order (cf. (33.126) ), 


Td Cee a EE, RCE: 
(Lancaster, 1951) 


33.29 The marginal probabilities of a r xc table, namely the p;,, pj, are known, and 
it is required to estimate the cell probabilities pij. A sample of m observations is taken 
from the multinomial distribution with probabilities pi. Show that the ML estimators 
piz of the pi are the solutions of the (r—1)(c—1) equations 

Se ge 8 ee eet 
Piji Pie Pro Pre 
Show also that these are the modified MV unbiassed linear estimators of the ;; obtained 
by applying (19.59) to the my @ = 1,2,...,7-1: j =f, 2,...,c-—1), their exact 
dispersion matrix V having been modified by replacing pi by fi throughout. 
(El-Badry and Stephan (1955); cf. also J. H. Smith (1947)) 


33.30 On c separate occasions, the same set of 1 individuals are observed in respect 
of the possession (scored 1) or absence (scored 0) of an attribute, and the results put in 
the form of a2x2x... X2 = 2° table. Let 7; (fj = 1, 2,..., 0c) be the total number 
of 1’s on the jth occasion and uw; (¢ = 1, 2,..., 2°) be the number of 1’s among the c 
coordinates of the 7th cell in the table. Show that if the probability of a ‘‘ 1 ”’ is identical 
on all ¢ occasions, the statistic 


O = c(c—1) >> (T; — T)?/(c ou; — D2) 
j=l i t 
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(where the summations in the denominator are over all non-empty cells) is asymptotically 
distributed as y? with (c—1) degrees of freedom. 


(Cochran, 1950; Madansky (1963) generalizes to the r° table) 


33.31 Defining the ratio Rij, = pij,/(pi.. p.j.), and Rix, Rix similarly, show that 
(33.133) may be written 
Hy : Rij, = (dist /p..0)/{(bi./D..%) (Dik /D..4)}; 
with similar expressions for H, in terms of R;,x and R;,x, and show that H; is the hypothesis 
that each of these three ratios is invariant under variation in its suppressed categorized 
variable. 


CHAPTER 34 
SEQUENTIAL METHODS 


Sequential procedures 

-34.1 When considering sampling problems in the foregoing chapters we have 
usually assumed that the sample number 7 was fixed. ‘This may be because we chose 
it beforehand ; or it may be because m was not at our choice, as for example when we 
are presented with the results of a finished experiment ; or it may be due to the fact 
that the sample size was determined by some other criterion, as when we decide to 
observe for a given period of time. We make our inferences in domains for which n 
is aconstant. For example, in setting a standard error to an estimate, we are effectively 
making probability statements within a field of samples all of size n. We might, 
perhaps, say that our formulae are conditional upon n. If m is determined in some 
way which is unrelated to the values of the observations, such a conditional argument 
is clearly valid. 


34.2 Occasionally, however, the sample number is a random variable dependent 
upon the values of the observations. One of the simplest cases is one we have already 
touched upon in Example 9.13 (Vol. 1, p. 225). Suppose we are sampling human beings 
one by one to discover what proportion belong to a rare blood-group. Instead of sam- 
pling, say, 1000 individuals and counting the number of occurrences of that blood-group 
we may prefer to go on sampling until 20 such members have occurred. We shall see 
later why this may be a preferable procedure; for the moment we take for granted 
that it is worth considering. In successive trials of such an inquiry we should doubtless 
find that for a fixed number of successes, say 20, the number n required to achieve them 
varied considerably. It must be at least 20 but it might be infinite (although the 
probability of going on indefinitely is zero, so that we are almost certain to stop sooner 
or later). 


34.3. Procedures like this are called sequential. ‘Their typical feature is a sampling 
scheme, which lays down a rule under which we decide at each stage of the drawing 
whether to stop or to continue sampling. In our present example the rule is very 
simple: if we draw a failure, continue; if we draw a success, continue also unless 
19 successes have previously occurred, in which event, stop. The decision at any point 
is, in general, dependent on the observations made up to that point. Thus, for a 
sequence of values x,, %2,...,%,, the sample number at which we stop is not inde- 
pendent of the x’s. It is this fact which gives sequential analysis its characteristic 
features. 

Sequential methods were first developed during the Second World War, principally 
by Wald (whose work is summarized in his book, Wald (1947)) in U.S.A., and simul- 
taneously in England by G. A. Barnard (1946). 
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34.4 The ordinary case where we fix a sample number beforehand can be regarded 
as a very special case of a sequential scheme. ‘The sampling procedure is then: go 
on until you have obtained m members, irrespective of what actual values arise. This, 
however, is a special case of such a degenerate kind that it really misses the point of the 
sequential method. 

If the probability is unity that the procedure will terminate, the scheme is said to be 
closed. If there is a non-zero probability that sampling can continue indefinitely the 
scheme is called open. We shall not seriously consider open schemes in this chapter. 
They are obviously of little practical use compared to closed schemes, and we usually 
have to reduce them to closed form by putting an upper limit to the extent of the 
sampling. Such truncation often makes their properties difficult to determine exactly. 

Usage in this matter is not entirely uniform in the literature of the subject. 
“‘ Closed’ sometimes means ‘ truncated,” that is to say, applies to the case where 
some definite closure rule puts an upper limit to the amount of sampling. Corres- 
pondingly, ‘‘ open” sometimes means “ non-truncated.” 


Example 34.1 


As an example of a fairly simple sequential scheme let us consider sampling from a 
(large) population with proportion @ of successes. We will proceed until m successes 
are observed and then stop. It scarcely needs proof that such a scheme is closed. 
The probability that in an infinite sequence we do not observe m successes is zero. 

The probability of m—1 successes in the first n—1 trials together with a success 
at the mth trial is (cf. 5.14-15) 

sat oy, =, m+), 2S, (34.1) 
where y= 1—o. This gives us the distribution of n. ‘The frequency-generating 
function of m (with the origin at zero) is given by 


(=): (34.2) 


Thus for the cumulant-generating function we have 


me \* w 
y(t) = log (=) = m log (--) 


Expanding this as far as the coefficient of ¢* we find 
a=? 
— e 


(34.3) 


w 


Pe ee Bag (34.4) 


Thus the mean value of the sample number n is m/w. It does not follow that m/n is 
an unbiassed estimator of w. Such an unbiassed estimator is, in fact, given by- 


p = (m=1)/(n-1), (34.5) 
* (1) - Eri) la) 


= Od bes eo = ee. (34.6) 
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The variance of this estimator is not expressible in a very concise form. We have 


— a m—1.~1—m 
B( = (m—1)a™-1y x (a 


n—1 n=™m 
= (m—1)0™ 7! lect 2 dt 
= (m—1) 0" yt-™ | “ym—2(1 —2)1-™ de, (34.7) 
0 


Putting wu = ot/{y(1—2)} we find 
m— 1 ,,m— 
E( i) = (m—1) 0° | = 


n—1 0 O+ xu 
1 co 
= (m—1) 0° | uma p> A (—u) bd 
0 j=0 
(m—1) 0% S x B(m—1,j +1) 
j=0 


= Zz 2! y? 313 
“ 148+ Ent eeciard 


of 


= |: (34.8) 


Hence, subtracting w?, we have 
aes | 2x 67? | 9 
ES See beeeeee eee see Sage, 
We can obtain an unbiassed estimator of var p in a simple closed form. In the same 
manner that we arrived at (34.6) we have 
(m—1)(m—2) _ oa 


(n—1)(n—2) ~ 
m—1\? (m—1)(m—2)) _ ,(m-1 — 
was Ca) een) 
_ {m—-1\? (m—1)(m—2) 
Thus Est. var p = (==) — fe 
_ (m—1)(n—m) 
(n—1)*(n—2) 
_fU=))_ Fe 
ae (34.10) 


We note that for large u this is asymptotically equal to the corresponding result for 
fixed sample size n. 
An estimator of the coefficient of variation of p for this negative binomial distri- 

bution is given by 

p(m—q)* 

{p*(1—p)}?” 
and for small p this becomes approximately »/(m—1). Thus for the sequential process 
the relative sampling variation of p is approximately constant. 
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This sequential scheme is often called inverse binomial sampling—it was first discussed 

by Haldane (1945) and Finney (1949b), W. Knight (1965) unifies its theory for binomial, 
Poisson, hypergeometric and exponential distributions. 


34.5 The sampling of attributes plays such a large part in sequential analysis 
that we may, before proceeding to more general considerations, discuss a useful dia- 
grammatic method of representing the process. 


Noa np 
SSeS Saas ee 


= 
= SS SSaent: 
— SSeS NeSr als 
- =S2 2 SS Sehsr= 
SS S22 SS NGEESe 
2 sse55s ee Ee 


Failures 
Fig. 34.1 


Successes 


Take a grid such as that of Fig. 34.1 and measure number of failures along the 
abscissa, number of successes along the ordinate. The sequential drawing of a sample 
may be represented on this grid by a path from the origin, moving one step to the 
right for a failure F and one step upwards for a success S. ‘The path OX corresponds, 
for example, to the sequence FFSFFFSSFFFFSFS. A stopping rule is equivalent 
to some sort of barrier on the diagram. For example, the line AB is such that S+ Ff = 9 
and thus corresponds to the case of fixed sample size n = 9. ‘The line CD corresponds 
to S = 5 and is thus of the type we considered in Exercise 34.1 with m= 5. ‘The 
path OX, involving a sample of 15, is then one sample which would terminate at X. 
If X is the point whose co-ordinates are (x, y) the number of different paths from O to X 
is the number of ways in which x can be selected from (x+y). ‘The probability of 
arriving at X is this number times the probability of x S’s and y F’s, namely 


x+y a 
( = )e 9, 
Example 34.2. Gambler's Ruin 
One of the oldest problems in the theory of probability concerns a sequential pro- 
cess. Consider two players, A and B, playing a series of games at each of which A’s 
chance of success is w and B’s is 1—a@. ‘The loser at each game pays the winner one 
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unit. If A starts with a units and B with b units what are their chances of ruin (a 
player being ruined when he has lost his last unit) ? 

A series of games like this is a sequential set representable on a diagram like Fig. 34.1. 
We may take A’s winning as a success. ‘The game continues so long as A or B has 
any money left but stops when A has a+b (when B has lost all his initial stake) or 
when B has a+b (when A has lost his initial stake). "The boundaries of the scheme are 
therefore the lines y-x = —a and y—x = Bb. 


Fig. 34.2 


Fig. 34.2 shows the situation for the case a = 5,b = 3. The lines AB, CD are at 
45° to the axes and go through F = 0, S = 3 and F = 5, S = 0 respectively. For 
any point between these lines S—F is less than 3 and F—S is less than 5. On AB, 
S—F is 3, and if a path arrives at that line B has lost three more games than A and is 
ruined ; similarly, if the path arrives at CD, B is ruined. ‘The sequential scheme is, 
then: if the point lies between the lines, continue sampling; if it reaches AB, stop 
with the ruin of B; if it reaches CD, stop with the ruin of A. 

The actual probabilities are easily obtained. Let u, be the probability that A will 
be ruined when he possesses x units. By considering a further game we see that 


Uy = DUzyit XUz-1, (34.11) 
with boundary conditions 

to = 1, ty, = 0. (34.12) 
The general solution of (34.11) is 

uy = At?+ BE 


where f¢, and f, are the roots of 
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namely f= 1 and i ¢-cty/m. 
Provided that w # y, the solution is then found to be, on using (34.12), 


(=) (e) ao # 2. (34.13) 


eS. 
(s) —! 
oy 


If, however, w = i, the solution is 
= 4+6—*x 


2 : (34.14) 
a+b 
In particular, at the start of the game, for a = 3, x = a, 
b 
Ul, = — (34.15) 


34.6 We can obviously generalize this kind of situation in many ways and, in 
particular, can set up various types of boundary. A closed scheme is one for which it 
is virtually certain that the boundary will be reached. 

Suppose, in particular, that the scheme specifies that if A loses he pays one unit 
but if B loses he pays k units. The path on Fig. 34.2 representing a series then consists 
of steps of unity parallel to the abscissa and k units parallel to the ordinate. And this 
enables us to emphasize a point which is constantly bedevilling the mathematics of 
sequential schemes : a path may not end exactly on a boundary, but may cross it. For 
example, with k=3 such a path might be OX in Fig. 34.2. After two successes and 
five failures we arrive at P. Another success would take us to X, crossing the boundary 
at M. We stop, of course, at this stage, whether the boundary is reached or crossed. 
The point of the example is that the exact probability of reaching the boundary at M 
is zero—in fact, this point is inaccessible. As we shall see, such discontinuities some- 
times make it difficult to put forward exact and concise statements about the proba- 
bilities of what we are doing. We refer to such situations as ‘‘ end-effects.”” In most 
practical circumstances they can be neglected. 


Sequential tests of hypotheses 

34.7 Let us apply the ideas of sequential analysis to testing hypotheses and, in the 
first instance, to choosing between H, and H,. We suppose that these hypotheses 
concern a parameter 0 which may take values 0) and 6, respectively; i.e. Hy and H, 
are simple. We seek a sampling scheme which divides the sample space into three 
mutually exclusive domains: (a) domain ,, such that if the sample point falls within 
it we accept H, (and reject H,); (b) domain ,, such that if the sample point falls 
within it we accept H, (and reject Hy); (c) the remainder of the sampling space, w,— 
if a point falls here we continue sampling. In Example 34.2, taking A’s ruin as Ho, 
B’s ruin as H,, the region w, is the region to the right of CD, including the line itself ; 
w, is the region above AB, including the line itself ; w, is the region between the lines. 


Operating characteristic 
34.8 The probability of accepting H) when H, is true is a function of 0, which 
we shall denote by K(0;). If the scheme is closed the probability of rejecting H) when 
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H, is true is then 1—K(6,). Considered as a function of 0, for different values of 0, 
this is simply the power function. As in our previous work we could, of course, work 
in terms of power ; but in sequential analysis it has become customary to work with 
K(0,) itself. 

K (6) considered as a function of 0 is called the ‘‘ Operating Characteristic ’’ (OC) 
of the scheme. Graphed as ordinate against 0 as abscissa it gives us the ‘‘ OC curve, ”’ 
the complement (to unity) of the Power Function. 


Average sample number 

34.9 A second function which is used to describe the performance of a sequential 
test is the ‘‘ Average Sample Number”’ (ASN). ‘This is the mean value of the sample 
number 7 required to reach a decision to accept H, or H, and therefore to discontinue 
sampling. ‘The OC for H, and H, does not depend on the sample number, but only 
on constants determined initially by the sampling scheme. ‘The ASN measures the 
amount of sampling we have to do to implement that scheme. 


Example 34.3 

Consider sampling from a (large) population of attributes of which proportion aw 
are successes, and let @ be small. We are interested in the possibility that @ is less 
than some given value w». ‘This is, for example, a frequently arising situation where a 
manufacturer of some item wishes to guarantee that the proportion of rejects in a batch 
of articles is below some declared figure. Consider first of all the alternative a, > @p. 

We will take a very simple scheme. If no success appears we proceed to sample 
until a pre-assigned sample number my has appeared and accept wy. If, however, a 
success appears we accept w, and stop sampling. 

If the true probability of success is w, the probability that we accept the hypothesis 
is then (l1—a)"" = y”. ‘This is the OC. It is a J-shaped curve decreasing mono- 
tonically from o = 0tow = 1. For two particular values we merely take the ordinates 
at w, and ay. 

The common sense of the situation requires that we should accept the smaller of 
w, and aw, if no success appears, and the larger if a success does appear. Let wy be 
the smaller ; then the probability of a Type I error « equals 1 — 7(° and that of an error 
of Type II, B, equals yi. If we were to interchange @, and @,, the «-error would be 
1—y%> and the f-error (°, both of which are greater than in the former case. 

We can use the OC in this particular case to provide a test of the composite 
hypothesis Hy: @ < wp, against H,:a > ay. In fact, if @ < wy the chance of an 
a-error is less than 1—y"» and the chance of a f-error is less than yj. 

The ASN is found by ascertaining the mean value of m, the sample number at 
which we terminate. For any given a this is clearly 


Ny—1 
x ma(l—o)"-1+n,(l—a)"? 
m=1 
Ny—1 
= Sie 2 (1—w)"+n,(1—a)"* 
2ixti-o (34.16) 


(ov) 
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The ASN in this case is also a decreasing function of @ since it equals 
— So. eS ee ee 

We observe that the ASN will differ according to whether wy or @, is the true value. 

A comparison of the results of the sequential procedure with those of an ordinary 
fixed sample-size is not easy to make for discontinuous distributions, especially as we 
have to compare two kinds of error. Consider, however, a) = 0-1 and = 30. From 
tables of the binomial (e.g. Biometrika Tables, Table 37) we see that the probability of 
5 successes or more is about 0-18. ‘Thus on a fixed sample-size basis we may reject 
a = 0-1 in a sample of 30 with a Type I error of 0-18. For the alternative wo = 0-2 
the probability of 4 or fewer successes is 0-26, which is then the Type II error. 

With the sequential test, for a sample of m, the Type I error is 1 — y¢* and the Type II 
error is y. For a sample of 2 the Type I error is 0-19 and the Type II error 0-64. 
For a sample of 6 the errors are 0-47 and 0-26 respectively. We clearly cannot make 
both types of errors correspond in this simple case, but it is evident that samples of 
smaller size are needed in the sequential case to fix either type of error at a given level. 
With more flexible sequential schemes, both types of error can be fixed at given 
levels with smaller ASN than the fixed-size sample number. In fact, their economy 
in sample number is one of their principal recommendations—cf. Example 34.10. 


Wald’s probability-ratio test 

34.10 Suppose we take a sample of m values in succession, %}, ¥2, . . . , Xm) from a 
population f(«*,0). At any stage the ratio of the probabilities of the sample on hypo- 
theses H,(60 = 9,) and H,(6 = 9;) 1s 


—— i 7) 7 Tr 7, i (34.17) 


We select two numbers A and B, related to the desired «- and f-errors in a manner 
to be described later, and set up a sequential test as follows: so long as B< Ly, < A 
we continue sampling ; at the first occasion when L,, > A we accept Hy; at the first 


occasion when L,, < B we accept Ho. 
An equivalent but more convenient form for computation is the logarithm of Ln, 


the critical inequality then being 
log B < © log f(y 93)— = log f(*,90) < log A. (34.18) 
i=1 i=1 


This family of tests we shall refer to as “ sequential probability-ratio tests” (SPR 
tests). 


34.11 We shall often find it convenient to write 
2; = log {f (*i 91)/f(%i 90) }s (34.19) 
and the critical inequality (34.18) is then equivalent to a statement concerning the 
cumulative sums of z,’s. Let us first of all prove that a SPR test terminates with 
probability unity, i.e. is closed. 
QQ 
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The sampling terminates if either 


2 
or Xz; < log B. 


m 
The z,’s are independent random variables with variance, say o?7 > 0. & 2, then 
‘= 


has a variance mo®. As m increases, the dispersion becomes greater and the probability 
that a value of =z; remains within the finite limits log B and log A tends to zero. More 
précisely, the mean % tends under the central limit effect to a (normal) distribution with 
variance o?/m, and hence the probability that it falls between (log B)/m and (log A)/m 
tends to zero. 

It was shown by Stein (1946) that E(e”*) exists for any complex number t whose real 


part is less than some ty > 0. It follows that the random variable m has moments of 
all orders. 


Example 34.4 
Consider again the binomial distribution, the probability of success being a. If 
there are k successes in the first m trials the SPR criterion is given by 


= oF = 1l-o, 
log Lm = klog es k) log i (34.20) 
This quantity is computed as we go along, the sampling continuing until we reach the 
boundary values log B or log A. How we decide upon A and B will appear in a 


moment. 


34,12 It is a remarkable fact that the numbers A and B can be derived very simply 
(at least to an acceptable degree of approximation) from the probabilities of errors of 
the first and second kinds, « and f, without knowledge of the parent population. ‘There 
are thus no distributional problems to be solved. This does not mean that the 
sequential process is distribution-free. All that 1s happening is that our knowledge 
of the frequency distribution is put into the criterion L,, of (34.17) and we work with 
this ratio of likelihoods directly. It will not, then, come as a surprise to find that 
SPR tests have certain optimum properties ; for they use all the available information, 
including the order in which the sample values occur. 

Consider a sample for which L,, lies between A and B for the first n—1 trials and 
then becomes > A at the mth trial so that we accept H, (and reject Hy). By definition, 
the probability of getting such a sample is at least A times as large under H, as under 
H,. ‘This, being true for any one sample, is true for all and for the aggregate of all 
possible samples resulting in the acceptance of H,. The probability of accepting H, 
when H, is true is «, and that of accepting H, when H, is true is 1—f. Hence 


1—6 > Ao 


or 


aN 

IN 
bb 
ese) 


(34.21) 
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In like manner we see from the cases in which we accept H, that 
B < B(i—2), 
or Be P (34.22) 


l—« 
34.13 If our boundaries were such that A and B were exactly attained when 


attained at all, i.e. if there were no end-effects, we could write 


=i t ea ae (34.23) 


l1—« 


In point of fact, Wald (1947) showed that for all practical purposes these equalities 
could be assumed to hold. Suppose that we have exactly 


oa = b = aS (34.24) 


and that the true errors of first and second kind for the limits a and b are «’, 6’. We 
then have, from (34.21), 


a 1 m4 
if 2 ee (34.25) 
and from (34.22) 
pr a 
— <b$= —— (34.26) 
Hence 
a! < aes BP) re (34.27) 
» - Pilea) PB 
B’ < ja < —- (34.28) 
Furthermore, 
a (1—B) +f" (1-2) < a(1—6’) +8 (1-2) 
or 


a +B’ < at B. (34.29) 


Now in practice « and f are small, often conventionally 0-01 or 0-05. It follows from 
(34.27) and (34.28) that the amount by which «’ can exceed «, or f’ exceed f, is 
negligible. Moreover, from (34.29) we see that either «’ < « or B’ < 6. Hence, by 
using a and 6 in place of A and B, the worst we can do is to increase one of the errors, 
and then only by a very small amount. Such a procedure, then, will always be on the 
safe side in the sense that for all practical purposes it will not increase the errors of 
wrong decision. ‘To avoid tedious repetition we shall henceforward use the equalities 
(34.23) except where the contrary is specified. 


Example 34.5 
Consider again the binomial of Example 34.4 with « = 0-01, 6 = 0-10 w, = 0-01 
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and w, = 0:03. We have, for k successes and n—k failures (taking logarithms to 


base 10), 


log < (n—k) log j——! + hog =! < log§—? 
or 
log 5 < (n—k) log, + klog 3 < log 90 
or” —0-995,653 < —0-008,863,5 (7—k)+(0-477,121)k < 1-954,243. 
Dividing through by 0-008,863,5 we find, to the nearest integer, 
—112 < 54k-—n < 220. (34.30) 


For a test of this kind, for example, if no failure occurred in the first 112 drawings 
we should accept wy). If one occurred at the 100th drawing and another at the 200th, 
we could not accept before the 220th (i.e., 112 +(2 x 54) ) drawing. And if, by the 200th 
drawing, there had occurred 6 failures, say at the 50th, 100th, 125th, 150th, 175th, 
200th, we could not reject, 54k—m being 124 at the 200th drawing ; but if that experi- 
ence was then repeated, the quantity 54k —n would exceed 220 and we should accept a. 


The OC of the SPR test 
34.14 Consider the function 
0,)\* 
Ls = {Ly \ : 34.31 
(6,0) ee 
where hf is a function of 6. L’f(x,0), say g(x, 6), is a frequency function for any value 
of 6 provided that 


6,))* 
E(L = [Fe 1 x,0)dx = 1 34.32 
Oy) = [eet 169 (34.32) 
It may be shown (cf. Exercise 34.4) that there is at most one non-zero value of h satis- 
fying this equation. Consider the rule: accept Hy, continue sampling, or accept H, 
according to the inequality 


IT {L" f(x, 6) } 
B < A < A’. 34.33 
1 (F(@;0)} oe 
This is evidently equal to the ordinary rule of (34.18) provided that h > 0. Consider 
testing H: that the true distribution is f(x, 6), against G: that the true distribution is 


g(x,0). Ifa’, B’ are the two errors, the likelihood ratio is the one appearing in (34.33), 
and we then have 


we Ee (34.34) 


and hence 
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and since a’ is the power function when H, holds, its complement, the OC, is given by 


h_ 
iw’ = 5 : (34.35) 


At Be 
The same formula holds if h < 0. 

We can now find the OC of the test. When A(6) = 1 we have the performance 
at 0 = 0). When h(0) = —1 we have the performance at 0 = 0,. For other values 
we have, in effect, to solve (34.32) for 0 and then substitute in (34.35). But this is, in 
fact, not necessary in order to plot the OC curve of K(@) against 0. We can take 
h(@) itself as a parameter and plot (34.35) against it. 


Example 34.6 
Consider once again the binomial of previous Examples. We may write for the 
discrete values 1 (success) and 0 (failure) 


f(1, a) = a, 


f(0, a) = 1-@. 
Then (34.32) becomes 
a (2) +(1-—a) (==) a :, 
= 1—a,\" 
ee Se (34.36) 
or To fi-oy e ; 
Tg 1—a,y 


For A = (1—£)/«, B = B/(1—«) we then have from (34.35) 
Fy -1 


a 


bef of PY 
Oo l—« 
We can now plot K(a) against w by using (34.36) and (34.37) as parametric equations 
in fA. 


Kage (34.37) 


The ASN of the SPR test 
34.15 Consider a sequence of m random variables z;. If m were a fixed number 
we should have 


E( x 2) = n E(x). 
i=1 
This is not true for sequential sampling, but we have instead the result 
B( x “.) = E(n) E(z), (34.38) 
i=1 


which is not quite as obvious as it looks. ‘The result is due to Wald and to Blackwell 
(1946), the following proof being due to Johnson (1959b). 
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Let each z; have mean value mw, E|2z;| < C < o, and let the probability that n 
takes the value k be P,(k = 1, 2,... ). Consider the “‘ marker ” variable y; which 
is unity if z; is observed (i.e. if m > 72) and zero in the opposite case. ‘Then 


P(y, = 1) = Pla > i) = 2 P) (34.39) 
j=t 
Now let Z, = » 3, Lhen 
i=1 
LZ, = 5 Viris 

i=1 

E(iZ) = EE Way] Eee (34.40) 
i=1 re 
which will be finite if (7) is finite. Furthermore, since y; depends only on 2, 2.,..., 
2;-, and not on 2;, we have 


Hence 
E(Z,) = YE(y) Ez) = wXE(y,) 


= ph = {P+ Pisit ...} 


=U Zeit, 
i=1 
= wE(n), 
whence (34.38) follows. 
We then have 
3 E(Z,) 
E(n) = TOM (34.42) 


But, to our usual approximation, Z,, can take only two values for the sampling to termin- 
ate, log A with probability 1—A (0) and log B with probability K(@). ‘Thus 


_ KlogB+(1-—K)log A 


E(n) E(2) (34.43) 
which is the approximate formula for the average sample number. 
Example 34.7 
For the binomial we find 
= vat 
E(z) = Elog (=:) 
wD l-o 
= wlog mal ~ @) log -——*. (34.44) 


The ASN can then be calculated from (34.43) when w,, w,, A and B (or a and f) are 
given. It is, of course, a function of a. 
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34.16 For practical application, sequential testing for attributes is often expressed 
in such a way that the calculations are in terms of integers. Equation (34.30) is a 
case in point. We may rewrite it as 

332 > 220+(n—k)—53k > 0 

We may imagine a game in which we start with a score of 220. If a failure occurs 
we add one to the score; if a success occurs we lose 53 units. ‘The game stops as 
soon as the score falls to zero or rises to 332, corresponding to acceptance of the 
values @) and a, respectively. 


34.17 On such a scheme, suppose that we start with a score S,. For every failure 
we gain one unit, but for every success we lose 6 units. If the score rises by S, so 
as to reach S,+ S, (= 2S, say) we accept one hypothesis ; if it falls to zero we accept 
the other. Let the score at any point be x and the probability be uw, that it will ultim- 
ately reach 2S without in the meantime falling to zero. Consider the outcome of the 
next trial. A failure increases the score by unity to x+1, a success diminishes it by 


bto x—b. Thus 


Uy = (1—B) uy 11+ 0Uz_», (34.45) 

with initial conditions 
Se Se eS eee (34.46) 
ee (34.47) 


For 6 = 1 this equation is easy to solve, as in Example 34.2. For 6 > 1 (and we shall 
suppose it integral) the solution is more cumbrous. We quote without proof the 
solution obtained by Burman (1946) 


_ F(x) 
Un = FS) (34.48) 
where 
roe iC teen Eom 
_ eres 7 (wy’)?+ .. : x = 1), (34.49) 
= 6, a= 0. 


Here the series continues as long as x— kb—1 is positive. Burman also gave expressions 
for the ASN and the variance of the sample number. 


34.18 Anscombe (1949a) tabulated functions of this kind. Putting 
a _ S2 
"ee Gat PS et 
Anscombe tabulates R,, R, for certain values of the errors «, 6 (actually 1—« and ) 
and the ratio S,/S,, the values for 7(b+1) being also provided. 
Given a, @,, «, 6 we can find R, and R,. There remains an element of choice 
according to how we fix the ratio S,/S,. 


(34:50) 
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Thus, for ow) = 0-01, wo, = 0-03, S, = 2S,, « = 0-01, 6 = 0-10 we find R, = 4, 
R, = 2approximately. Also a(b+1) = 0-571 orb = 56. We then find, from (34.50), 
S, = 114, S, = 228. The agreement with Example 34:5 (S, = 112, S, = 220, 
b = 53) is very fair. The ASN for a = 0-01 is 253 and that for a = 0-03 is 306. 


34.19 It is instructive to consider what happens in the limit when the units 1 and 6 
are small compared to the total score 211. We can imagine this, on the diagram of 
Fig. 34.1, as a shrinkage of the mesh so that the routes approach a continuous random 
path of a particle subject to infinitesimal disturbances in two perpendicular directions. 
From this viewpoint the subject links up with the theory of Brownian motion and 
diffusion. If A is the difference operator defined by 


At, = Uzs4— Uz 
we may write equation (34.45) in the form 
{(1—w)(1+A)+a(1+A)~°—-1}u, = 0. (34.51) 
For small } this is nearly equivalent to 
{(1-o)+a0—-14+(1—w)A—boA+4b(b+1) oA? }u, = 0, 


namely, to 
{(1—-w—baw)A+40(b+1) aA? u, = 0. (34.52) 
In the limit this becomes 
du du 
where 
2(1—a—bo) 
a Sea 54 
ab (b+ 1) =) 


The general solution of (34.53) is 

ut. = ki +kh,e-, 
and since the boundary conditions are 

Us = I, Ug = 0, 
we have 

1 —exp(—Ax) 
1—exp{—A(5, 257 oS 
Thus for x = S, the probability of acceptance is 
es exp (A S,)—1 
®: exp (A.S,)—exp(—4S,) 


34.20 As before, write 
R,(b+1) = S;,, R,(6+1) = S,, 
and let w tend to zero so that w(b+1) = y, say, remains finite. From (34.54) we 
see that 2 tends to zero, but 
ease 2{1—w(b+1)} _ 2(1-7) 
@(b+1) y 


Uy = 


(34.56) 


= 5, say. (34.57) 
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Then 1S, tends to S,6/b, i.e. to R, 6, provided that y 4 1, and (34.56) becomes 


a exp (R, 6)—1 
“8: exp (R, 6)—exp(—R, 6)’ Loe Sa, 
If 6 tends to zero we find 
2 = 


(34.58) may be compared to (34.37) which, for small 4, can be written 


exp (<i) —1 


= exp (Ph) —exp (-5**) , 


34.21 ‘The use of sequential methods in the control of the quality of manufactured 
products has led to considerable developments of the kind of results mentioned in 
34.17 and 34.18. We shall not have the space here to discuss the subject in detail 
and the reader who is interested is referred to some of the textbooks on quality control. 
We will merely mention some of the extensions of the foregoing theory by way of 
illustrating the scope of the subject. 

(a) Stopping rules. Even for a closed scheme it may be desirable to call a halt in 
the sampling at some stage. For instance, circumstances may prevent sampling beyond 
a certain point of time; or, in clinical trials, medical etiquette may require a change of 
treatment to a new drug which looks promising even before its value is fully established. 
Sequential schemes may be truncated in various ways, the simplest being to require 
stopping either after a given sample size has been reached or when a given time has 
elapsed. In such cases our general notions about performance characteristics and 
average sample numbers remain unchanged, but the actual mathematics and arithmetic 
are usually far more troublesome. Armitage (1957) considers sequential sampling 
under various restrictions. 

(b) Rectifying inspection. In the schemes we have considered the hypotheses were 
that the batch or population under inquiry should be accepted or rejected as having a 
specified proportion of an attribute. If the attribute is ‘‘ defectiveness ”’ we may prefer 
not to reject a batch im toto but to inspect every member of it and to replace the defective 
ones. ‘This does not of itself affect the general character of the scheme—the decision 
to reject is replaced by a decision to rectify—but it does, of course, affect the proportion 
of rejects in the whole aggregate of batches to which the sampling plan is applied— 
what is known as the average outgoing quality level (AOQL); and hence it affects the 
values of the parameters which we put into the plan. The theory was examined by 
Bartky (1943). 

(c) Double sampling. As an extension of this idea, we may find it economical to 
proceed in stages. For example, we may decide to have four possible decisions: to 
accept outright ; to reject outright ; to continue sampling ; to suspend judgment but 
to inspect fully. There is evidently a wide variety of possible choice here. An excellent 
example is the double sampling procedure of Dodge and Romig (1944). We shall 
encounter the idea again later in the chapter (34.36). 


(34.60) 


us 
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Example 34.8 


Consider the testing, for the mean of a normal distribution with unit variance, 
of H,(u = fo) against H,(u = m,). With 2 defined as at (34.19) we have 


2, = —4(*;,—M,)? +3 (%:—Ho)? 
= (fly — Mo) %; — 4 (ui— MG). 
Zn = Uy = M (ty — [tg &— 3m (uy — Mo) (34.61) 


We accept H, and H, according as this quality < log B or > log A. For the appropri- 
ate OC curve we have, from (34.35), 


A*—1 
where / is given by 
1 6) 

aay | _.exP LPs Ho) ®—B (ek —18)}Lexp{—4@—a)F de = 1, (94.63) 
which is easily seen to be equivalent to 
exp {2 —huj+hyo—(u-hyy+hu)?} = 1 

h= fiat Morell My F Mo, HAO. (34.64) 

Hat #0 

We can then draw the OC curve for a range of values of u by calculating 4 from (34.64) 


and substituting in (34.62). 
Likewise for the ASN we have 


or to 


(2) = iors [exp {8-H sw) 2B He) 


= (My— Mo) M— 3 (Hi Ho) (34.65) 

Again, for a range of w the ASN can be determined from this equation in conjunction 
with (34.62) and 

_ KlogB+(1— K)log A 


E(n) AG (34.66) 


Example 34.9 


Suppose that the mean of a normal distribution is known to be yw. To test a 
hypothesis, concerning its variance, Hy: 0? = of against H,: 0? = of, we have 


Zm = U2; = —mlogo, i hey Cn X (*—b)?. (34.67) 
20% 20% 
This lies between the limits log {8/(1—«)} and log {(1—f)/«} if 
lap shes < —mlog!—4 (“. ” =) Sireciiedegacae. (34.68) 


— ob 
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With some rearrangement we find that this is equivalent to 


= 2 2 
2 log Panto 2! 2 log P +m log 2 
o GF : l-« oO 
7 T < U(x—p)? < aes (34.69) 
a Oj oO; 


The OC and ASN are given in Exercises 34.18 and 34.19. 

If the mean is not known the test remains the same except that the test statistic 
u(x—)? is replaced by X&(x—.*)? and the value m in the inequality (34.69) is replaced 
by (m-—1). 


The efficiency of a sequential test 

34.22 In general, many different tests may be derived for given « and , 0, and 0. 
There is no point in comparing their power for given sample numbers because they 
are arranged so as to have the same f-error. We may, however, define efficiency in 
terms of sample size or ASN. The test with the smaller ASN may reasonably be said 
to be the more efficient. Following Wald (1947) we shall prove that when end-effects 
are negligible the SPR test is a most efficient test. More precisely, if S’ is a SPR test 
and S is some other test based on the sum of logarithms of identically distributed 
variables, 


An Sys fim 5), 2 = 0, 1, (34.70) 
where L; denotes the expected value of m on hypothesis H,,. 


Note first of all that if wis any random variable, u— (wz) is the value measured from 
the mean, and 


exp {u—E(u)} > 14+ {u—E(u)}. 
On taking expectations we have | 
Efexp{u—E(u)}] > 1, (34.71) 
which gives 
E(expu) > exp {E(u)}. (34.72) 
We also have, from (34.42), for any closed sequential test based on the sums of type Z,, 
E; (log L,, | S) 
Ei(z) 
If £* denotes the conditional expectation when H, is true, and £** the conditional 
expectation when H, is true, we have, as at (34.22), neglecting end-effects, 


E,(n| S) = (34.73) 


E*(L,|S) = = (34.74) 
and similarly, as at (34.21), 
E**(L, |S) = = (34.75) 
Hence 
E,(n|S) = —1{(1—a) E* (logL,| S)+aE**(logL,|S)}. (34.76) 


E,(z) 
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In virtue of (34.72), (34.74) and = 75) we then have E,(z)<0 and 


Eo(n|S) > BNC —2)log =, (34.77) 
and interchanging H, and fH,, « = 6 in (34. 7) == 
E,(n| 8) > > FD {6 log F_ +(1 8) log = rl. (34.78) 


When S = 5S’ these inequalities, as at 34.43), are replaced (neglecting end-effects) by 
equalities. Hence (34.70). 


Example 34.10 

One of the recommendations of the sequential method, as we have remarked, is 
that for a given (a, f), it requires a smaller sample on the average than the method 
employing a fixed sample size. General formulae comparing the two would be difficult 
to derive, but we may illustrate the point on the testing of a mean in normal variation 
(Example 34.8). 

For fixed m and « the test consists of finding a deviate d such that 


Prob{uy—d < & < wot+d|Hy} = 1-«, 
Prob{uy—d < & < wo +d|H,} = 8B, 
and, putting 
Ay = V/n(d— Mo); 


A; = /n(d— 3), 
we have 
(A;—Ao)? A)? 
n= 34.79 
(Uo (g— fa)? ( ) 


Given «, 8, > and ,, m is determinable. Let us compare it with the ASN of a 
SPR test. Taking the ee formula (34.43), which is 


E,(n) = Es 5X log B+ {1—-K(u) }log A], (34.80) 
we find, since 
E,(2) = 3(Ho- #1)? 
and 
E — — 3 (Mo— 1)’, 
= i = aa aplflow B+ (1— flog 4}, (34.81) 
Likewise we find 
Ey(”) {= les Ba ees 34.82 
= = = 23 a GA! — a) 0g + & 0g }. ( : ) 


Thus, for « = 0-01, 8 = 0-03, A = 97, B=3/99 and we find A, = 2-5758, A, = — 18808. 
The ratio E,(n)/n is then 0-43 and E,(n)/n = 0-55. We thus require in the sequential 
case, on the average, either 43 or 55 per cent of the fixed sample size needed to 
attain the same performance. 
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It should be emphasized that a reduced sample size with a SPR test will only be 
found, in general, when one of H, or H, is true, and not necessarily for other parameter 
values. Guaranteed economy is therefore restricted to cases where the alternative 


hypotheses can be specified beforehand. Cf. T. W. Anderson (1960). 


Composite hypotheses 


34.23 Although we have considered the test of a simple Hy against a simple H,, 
the OC and ASN functions are, in effect, calculated against a range of alternatives 
and therefore give us the performance of the test for a simple H, against a composite H,. 
We now consider the case of a composite Hy. Suppose that 9 may vary in some 
domain 2. We require to test that it lies in some sub-domain , against the alternatives 
that it lies either in a rejection sub-domain w,, or in a region of indifference Q—w,—o, 
(which may be empty). We shall require of the errors two things: the probability 
that an error of the first kind, « (6), which in general varies with 0, shall not exceed some 
fixed number « for all 6 in w,; and the probability that an error of the second kind, 
6(@), shall not exceed # for all 6 in w,. Wherever our parameter point 0 really lies, 
then, we shall have upper limits to the errors, given by « and f. 


34.24 Such a requirement, however, hardly constitutes a very effective criterion. 
We are always on the safe side, but may be so far on the safe side in particular cases as 
to lose a good deal of efficiency. Wald (1947) suggested that it might be better to 
consider the average of «(6) over w, and of 6(0) over w, as reasonable criteria. ‘This 
raises the question as to what sort of average should be used. Wald defines two weight- 
ing functions, w,(0) and w,(0) such that 


| wa(0)d0=1, | w,(0)d0 = 1, (34.83) 

and we then define 
| wv, (0)«(6)d0 = «, (34.84) 
| w, (0) B(6) do = B. (34.85) 


By these means we reduce the problem to one of testing simple hypotheses. In 
fact, if we let 


== | fle OF (a0) - ++ f(Gims 4) 2 (0) ab (34.86) 


—— | F (219) f (22, 9) « - « f (2m 9) B (0) dO, (34.87) 


the likelihood ratio Lo»,/Z4m. can be used in the ordinary way with errors « and f. 
We may, if we like, regard (34.86) and (34.87) as the posterior probabilities of the sample 
when @ itself has prior probabilities w,(0) and w, (6). 


34.25 This procedure, of course, throws the problem into the form of finding or 
choosing the weight functions w,(6) and w,(0). We are in the same position as the 
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probabilist wishing to apply Bayes’ theorem. We may resolve it by some form of 
Bayes’ postulate, e.g. by assuming that w,(0) = 1 everywhere in w,. Another possi- 
bility is to choose w,(0) and w,(6) so as to optimize some properties of the test. 

For example, the choice of the.test is made when we select « and # (or, to our 
approximation, A and B) and the weight functions. Among all such tests there will 
be maximum values of «(8) and £(6). If we choose the weight functions so as to 
minimize (max «, max $), we have a test which, for given A and B, has the lowest 
possible bound to the average errors. If it is not possible to minimize the maxima of 
a and # simultaneously we may, perhaps, minimize the maximum of some function 
such as their sum. 


A sequential t-test 

34.26 A test proposed by Wald (1947) and, in a modified form, by other writers 
sets out to test the mean uw of a normal distribution when the variance is unknown. 
It is known as the sequential t-test because it deals with the same problem as 
‘“‘ Student’s ” ¢ in the fixed-sample case; but it does not free itself from the scale 
parameter o in the same way and the name is, perhaps, somewhat misleading. 

Specifically, we wish to test that, compared to some value vo, the deviation (u— f19)/o 
is small, say < 6.. The three sub-domains of 34.23 are then as follows: 


w, consists of (u»,c) for all o; 
w, consists of values for which |w—p,| > od, for all o; 


Q—w,—, consists of values for which 0 < | w—fo| < 08, for all o. 
We define weight functions for o as follows : 
1 
ra (34.88) 
= 0 elsewhere. 
1 
Ure = a 0 < O08 = ] (34.89) 
= 0 elsewhere. 


Then 
1 1 
- = ee —._ )) ee os : 
Lam | Ure (27) g™ exp { G2 (x; It) \ do 


ee eS 1 : 
1 1 : 
+, &&p {52 (x;— {y+ do) \] do. (34.90) 


ee ee 
| Fe = aon | —— exp { 5532 Lo) \ ae. (34.91) 


9 0 
The limit of the ratio Ly,/Lo, as c tends to infinity then becomes 


ARAMA aes Gn oc —cadintaes 


lim L ere F — 
mae \- Te exp) — 55 (11 Ho)? 
> om ja (34.92) 
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This depends on the x’s, which are observed, and on ju, and 6, which are given, but not on 
o, which we have integrated out of the problem by the weight functions (34.88) and 
(34.89). If we can evaluate the integrals in (34.92) we can apply this ratio to give a 
sequential test. 


34.27 The rather arbitrary-looking weight functions are, in fact, such as to optimize 
the test. To prove this we first of all establish (a) that «(u,o) is constant in @, ; 
(b) that B(u,) is a function of |(u—/)/o| alone: and (c) that 6 (u,@) is monotonically 
decreasing in |(u—p,)/o|. 

If is the sample mean and S? is the sum &(x,;—#)?, the distribution of the ratio 
(%—19)/S depends only on (u—j49)/o. If then we can show that (34.92) is a single- 
valued function of (#—j1,)/S, the properties (a) and (b) will follow, for (u—j)/o 1s zero 
in w, and B(u, o) depends only on the distribution of (34.92). Now the numerator and 
denominator of (34.92) are both homogeneous functions of (x;—{9) of degree m—1, as 
may be verified by putting x; = Ax,, fy = Avo, 9 = Ao. Thus the ratio of (34.92) is 
of degree zero. Further, it is a function of &(«—jo)? and X(x—p) only, and hence 
we do not change it by putting (x;—)/1/X(*#;—Mo)® for x;—fo. The ratio is, then, 
a function only of © (x;—/Mo)/+/=(x;—fo)*, and is, in actual fact, a function of the 
square of that quantity, namely of 

(%—Mo)® _ __ (®— Mo)? 
U(x;—Mo)?  n(¥—po)?+S? 
It is therefore a single-valued function of (#—j)/S. 

To show that £(u,0) is monotonically decreasing in |(“—j)/o| it is sufficient to 
show that the ratio (34.92) is a strictly increasing function of | (#— “9)/S| , or equivalently 
of (¥$—,)?/D(x;—fo)?. Now for fixed X(«;—j5)? the denominator of (34.92) is fixed 
and the numerator is an increasing function of (#—jf)*. Thus the whole ratio is 
increasing in (€—j)? for fixed X(x;—j)* and the required result follows. 


34.28 Under these conditions we can prove that the sequential ¢-test is optimal. 
In fact, any test is optimal if (i) «(0) is constant in w,; (ii) 6 (6) is constant over the 
boundary of w, ; and (iii) if 6 (#) does not exceed its boundary value for any 6 inside o,. 

To prove this, let v, and v, be two weight functions obeying these conditions and 
w,, w, two other weight functions. Let «, f be the errors for the first set, «*, 6* those 
for the second. ‘Then we have 


w l-« 
and hence 
{ a (0) 10,(0)d0 = a = 5, (34.93) 
{ B* (0), (6)d0 = B = ms ta (34.94) 


Thus, in w, the maximum of «* (6) is greater than (1—B)/(A—B), for the integral of 
w, (0) over that region is unity. But if v, has constant « over that domain, its maximum 
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is equal to (1—B)/(A—B). Hence 


max «* (0) > max « (6) in wg. 
Likewise, in w, the maximum of f/*(0) is attained somewhere outside w, and cannot 
exceed B(A—1)/(A—B), whereas for 6(0) the maximum value must be attained 
somewhere. Hence max p* (0) > max f (8) in a,. 


The result follows. The conditions we have considered are sufficient but by no means 
necessary. 


34.29 Some tables for the use of this test were provided by Armitage (1947). 
The integrals occurring in (34.92) are, in fact, expressible in terms of the confluent 
hypergeometric function and, in turn, in terms of the distribution of non-central t. 
Tables to facilitate sequential t-Tests (U.S.A. National Bureau of Standards, Applied 
Mathematics Series 7, 1951) is a co-operative work with an introduction by K. J. 
Arnold. See also later work by Armitage and others, most recently Myers et al. (1966). 

An alternative method of attack is given by D. R. Cox (1952a, b) for the case where 
the distribution of a set of sufficient statistics factorizes as at (23.118). A SPR test of 
6, can then be based on T,, free of the nuisance (often scale) parameters 6,, in the ordin- 
ary way. This approach also requires an invariance condition—cf. Hall et al. (1965). 
Hajnal (1961) develops a two-sample sequential f-test along these lines. 

D. R. Cox (1963) gives a large-sample sequential test for any composite hypothesis, 
based on ML theory—cf. Exercise 34.21. 


Sequential estimation 

34.30 In testing hypotheses we usually fix the errors in advance and proceed 
with the sampling until acceptance or rejection is reached. We may also use the 
sequential process for estimation, but our problems are then more difficult to solve 
and may even be difficult to formulate. We draw a sequence of observations with the 
object of estimating some parameter of the parent; but in general it is not easy to 
discern which is the appropriate estimator, what biases are present, what are the 
sampling errors or what should be the rules determining the ending of the sampling 
process. ‘The basic difficulty is that the sample number itself is a random variable. A 
secondary nuisance is the end-effect to which we have already referred.(") 


34.31 We derived at (34.42) the result equivalent to 

E{Z,,—nE(z)} = 0. Ea 3) 
Let us assume that absolute second moments exist, that the variance of each 2; is 
equal to o, and that E'(n?) exists. The variance of {Z,,—nE(z)} may then be derived 
(the proof is left to the reader as Exercise 34.5) as 

E{Z,—nE (z)}? = o® E(n) (34.96) 
and it follows that, with H(z) = u as before, 

w2E (n?) = o® E(n)+2uk (nZ,)— E (Zz). 


bf 


(*) Lehmann and Stein (1950) considered the notion of ‘‘ completeness ”’ in the sequential 
case, but general criteria are not easy to apply even in attribute sampling—cf. de Groot (1959). 
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If Z, and m are uncorrelated, this simplifies to . 

= en = oF E Var Z,,. (34.97) 
Results for higher moments have been obtained by Wolfowitz (1947)—cf. Exercise 
34.6. See also Chow et al. (1965). 


34.32 Now let 
Vets Slog f (09). (34.98) 
€=1 


Then since, under regularity conditions, 


we have, in virtue of relations like (34.95), 
ECY) = 0, (34.99) 


aaa = E(2 eter ) = ein) E (EL. (34.100) 


00 
If ¢ is an estimator of 6 with bias b(6), i.e. is such that 
E(t) = 6+5(6) 
we have, differentiating this equation, 
= ( yer = E(1z ee) = 148'(6). (34.101) 
Then, by the Cauchy—Schwarz inequality 


var E(2 e-) > {1+5' (6) }?, 
and hence, by (34.100), 
1+ 5’ (6) }? 
conn aloet\: F — rate (34.102) 
mB (8) 
which is Wolfowitz’s form of the upper bound to the variance in sequential estimation. 


It consists simply of putting E'(m) for m in the MVB (17.22). Wolfowitz (1947) also 
gives an extension of the result to the simultaneous estimation of several parameters. 


vart > 


Example 34.11 
Consider the binomial with unit index 


Ji“) = wti—oay, «x = 0,1. 


We have 
Ologf _ «x 1-x E dlog f\? _ 1 
am oa l-wv do } a(l—o) 
If p is an unbiassed estimator of w in a sample from this distribution, we then have 
a(1—a) 
var p > Eq) (34.103) 


RR 
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34.33 If mn is large the theory of sequential estimation is very much simplified in 
virtue of a general result due to Anscombe (1949b, 1952, 1953). Simply stated, this 
amounts to saying that for statistics where a Central Limit effect is present, the formulae 
for standard errors are the same for sequential samples as for samples of fixed size. 
We might argue this heuristically from (34.102). m varies about its mean m, with 
standard deviation of order n,~? and thus formulae accurate to order m—1 remain accur- 
ate to that order if we use my instead of m. More formally: 

Let {Y,}, n = 1, 2,...be a sequence of random variables. Let there exist a real 
number 0, a sequence of positive numbers {w, }, and a distribution function F'(«) such 
that 

(a) Y, converges to 0 in the scale of w,, namely 


pi < «+> FQ) aco; (34.104) 


(b) {Y,} is uniformly continuous in probability, namely given (small) positive 
e and‘y, 


p{/=2—"s <e for all n,n’ such that |n’—n| < en| > 1-7. (34.105) 


n 


Let {n,} be an increasing sequence of positive integers tending to infinity and {N,} 
be a sequence of random variables taking positive integral values such that N,/n, —> 1 
in probability as r—>oo. Then 

p{-a— < x eS ee (34.106) 


Wy. 


in all continuity points of F(x). 

The complexity of the enunciation and the proof are due to the features we have 
already noticed: end-effects (represented by the relation between N, and 7,) and the 
variation in 7,. 

In fact, let (34.105) be satisfied with » large enough so that for any n, > 

P{|N,—n,| < en,} > 1-7. (34.107) 


Consider the event E: |N,—n,| < cn, and | Yy,—Y,, 


< &Wy,, 


and the events A: | Y,,— Y,| < ew,, all nm’ such that |n’—n| < en, 


B: |N,-n,| < cn,. 


Then P(E) > P{A and B} = P(A)—P{A and not-B} 

> P(A)—P(not-B) 

> 1-—2n. (34.108) 
Also P{Yy,—0 < xwy,} = P{Yy,—6 < xw,, and E} 


+P{Yy,—0 < xw,, and not-E}. 
Thus, in virtue of the definition of EF we find 
P{Y,,—0 < (*—¢£) w,,}—2n < P{Yy,—-0 < xu, 
< P{Y,,—0 < (x+e)w,,}+2n, 
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and (34.106) follows. It is to be noted that the proof does not assume N, and Y, to 
be independent. 


34.34 To apply this result to sequential estimation, let x,, x2,... be a sequence 
of observations and Y,, an estimator of a parameter 0, D, an estimator of the scale 
w, of Y,. The sampling rule is: given some constant k, sampie until the first occurring 
D,, < k and then calculate Y,. We show that Y, is an estimator of 6 with scale 
asymptotically equal to k if k is small. 

Let conditions (34.104) and (34.105) be satisfied and {k,} be a sequence of positive 
numbers tending to zero. Let {N,} be the sequence of random variables such that 
N, is the least integer m for which D,, < k,; and let {n,} be the sequence such that n, 
is the least 2 for which w, < k,. We require two further conditions : 


(c) {w,} converges monotonically to zero and w,/w,,;—> 1 as n—> co; 
(d) N, is a random variable for all r and N,/n,—> 1 in probability as r — oo. 


Condition (c) implies that w,,/k, > 1asn—>oo. It then follows from our previous 
result that 


E ‘foe < x —> f(x) as $F (34.109) 


34.35 It may also be shown that if the x’s are independently and identically dis- 
tributed, the conditions (a) and (c)—which are easily verifiable—together imply con- 
dition (b) and the distribution of their sum tends to a distribution function. In 
particular, these conditions are satisfied for Maximum Likelihood estimators, for 
estimators based on means of some functions of the observations, and for quantiles. 


Example 34.12 
Consider the estimation of the mean yw of a normal distribution with unknown 
variance o%. We require of the estimator a (small) variance A’. 


The obvious statistic is Y, = %, For fixed m this has variance o?/n estimated as 


D? S (a, —#)2. (34.110) 


~ n(n—1) 
Conditions (a) and (c) are obviously satisfied and in virtue of the result quoted in 
34.35 this entails the satisfaction of condition (b). ‘To show that (d) holds, transform 


by Helmert’s transformation 
as = s 1 
ed OM 


| n—1 
ee a ee ee 
eee 


By the Strong Law of Large Numbers, given e, 7, there is a » such that 


Pil 


Then 


: 2 
eee een Ef —o? 
n—1;21 


oF forall 2S r} et (34.111) 
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If k is small enough, the probability exceeds 1—7 that D, < k for any n in the range 

2<n<v. Thus, given N > », (34.111) implies that 

5 

o? /k? 

with probability exceeding 1—7. Hence, as k tends to zero, condition (d) holds. 
The rule is, then, that we select k and proceed until D, < k. The mean # then has 

variance approximately equal to k?. 


€ 
1) < 4 


Example 34.13 


Consider the Poisson distribution with parameter equal to A. If we proceed until 
the variance of the mean, estimated as */n, is less than k?, we have an estimator * of A 
with variance k?. ‘This is equivalent to proceeding until the number of successes falls 
below k*n?._ But we should not use this result for small n. 

On the other hand, suppose we wanted to specify in advance not the variance but 
the coefficient of variation, say /. ‘The method would then fail. It would propose 
that we proceed until #/4/(%/n) is less than J, i.e. until n¥ < /? or the sum of observa- 
tions falls below /?. But the sum must ultimately exceed any finite number. This 
is related to the result noted in Example 34.1 where we saw that for sequential sampling 
of rare attributes the coefficient of variation is approximately constant. 


Stein’s double-sampling method 

34.36 At the end of Example 23.7, we observed that for fixed m no similar test of 
the mean yw of a normal population with unknown variance o? could have power 
independent of o*. ‘This implies (cf. 23.26) that no confidence interval of pre-assigned 
length can be found for ~. However, if we use a sequential method, these statements 
are no longer true, as Stein (1945) pointed out. 


34.37 We consider a normal population with mean w and variance o? and require 
to estimate “ with confidence coefficient 1—«, the length of the confidence-interval 
being /. We choose first of all a sample of fixed size m,, and then a further sample 
n—N, where n now depends on the observations in the first sample. 

Take a ‘‘ Student’s ”’ ¢-variable with n»—1 degrees of freedom, and let the prob- 
ability that it lies in the range —#, to t, be l1—«. Define 


l 
ee = 2n (34.112) 
Let s? be the estimated variance of the sample of my values, i.e., 
| = ae 2 (es — 5). (34.113) 
We determine n by 
nm = max {mo, 1 +[s?/z]}, (34.114) 


where [s?/z] means the greatest integer less than s?/z. 
Consider the m observations altogether, and let them have mean Y,. Then Y,, is 
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distributed independently of s and consequently (Y,,—/)+/m is independent of s; and 
hence (Y,—)+/n/s is distributed as ¢t with ny»—1 d.f. Hence 


p{ caw" < uh = l-«, 


St St 
ee < S n tte. 2 — — ) 
or PLY, a u< Vat b=<« 
or P{Y,-41 << Y,+hl} > 1-«. (34.115) 


The appearance of the inequality in (34.115) is due to the end-effect that s*/z may not 
be integral, which in general is small, so that the limits given by Y,+3/ are close to 
the exact limits for confidence coefficient 1—«. In point of fact we can, by a device 
suggested by Stein, obtain exact limits, though the procedure entails rejecting observa- 
tions and is probably not worth while in practice. 

Seelbinder (1953) and Moshman (1958) discuss the optimum choice of first sample 
size in Stein’s method. Bhattacharjee (1965) shows that Stein’s procedure is more 
sensitive to non-normality than ‘“‘ Student’s ”’ z-test (cf. 31.3), and that, as we should 
expect, non-normality re-introduces the dependence of the interval length (and corres- 
ponding test power) upon o?. 


34.38 Chapman (1950) extended Stein’s method to testing the ratio of the means 
of two normal variables, the test being independent of both variances. It depends, 
however, on the distribution of the difference of two t-variables, for which Chapman 
provides some tables. D.R. Cox (1952c) considered the problem of estimation in double 
sampling, obtaining a number of asymptotic results. He also considered corrections 
to the single and double sampling results to improve the approximations of asymptotic 
theory. A. Birnbaum and Healy (1960) discuss a general class of double sampling pro- 
cedures to attain prescribed variance. Graybill and Connell (1964) give a double sampling 
procedure for estimating a normal variance within a fixed interval. Goldman and Zeigler 
(1966) compare different methods in estimating a normal mean or variance: Stein’s is 
best for the mean. 


Distribution-free tests 

34.39 By the use of order-statistics we can reduce many procedures to the binomial 
case. Consider, for example, the testing of the hypothesis that the mean of a normal 
distribution is greater than jy (a one-sided test). Replace the mean by the median 
and variate values by a score of, say, + if the sample value falls above it and — in the 
opposite case. On the hypothesis Hy: «4 = fo these signs will be distributed binomially 
with @ = 3. On the hypothesis H,: 4 = ¢ytko the probability of a positive sign is 


— Fam [ exp(-4s9 dee. (34.116) 
We may then set up a SPR test of wy against w, in the usual manner. This will have 
a type I error « and a type II error £ of accepting Hy when H, is true ; and this type IT 
error will be < 6 when u—p, > ko. This is, in fact, a sequential form of the Sign 
test of 32.2—7. 

Tests of this kind are often remarkably efficient, and the sacrifice of efficiency 
may be well worth while for the simplicity of application. Armitage (1947) compared 
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this particular test with Wald’s t-test and came to the conclusion that, as judged by 
sample number, the optimum test is not markedly superior to the Sign test. 


34.40 Jackson (1960) has provided a bibliography on sequential analysis, classified 
by topic. Johnson (1961) gives a useful review of the subject. 


Decision functions 


34.41 In closing this chapter, we may refer briefly to a development of Wald’s 
idéas on sequential procedures towards a general theory of decisions. A situation is 
envisaged in which, at some stage of the sampling at least, one has to take a decision, 
e.g. to accept a hypothesis, or to continue sampling. ‘The consequences of these 
decisions are assumed to be known, and it is further assumed that they can be evaluated 
numerically. The problem is then to decide on optimum decision rules. Various 
possible principles can be adopted, e.g. to act so as to maximize expected gain or to 
minimize expected loss. Some writers have gone so far as to argue that all estimation 
and hypothesis-testing are, in fact, decision-making operations. We emphatically 
disagree, both that all statistical inquiry emerges in decision and that the consequences 
of many decisions can be evaluated numerically. And even in cases where both points 
may be conceded, it appears to us questionable whether some of the principles which 
have been proposed are such as a reasonable person would use in practice. ‘That 
statistics is solely the science of decision-making seems to us a patent exaggeration. 
But, like some questions in probability, this is a matter on which each individual has 
to make up his own mind—with such aid from the theory of decision functions as he 
can get. 

The leading expositions of this theory are the pioneer work by Wald himself (1950) 
and the more recent book by Blackwell and Girshick (1954). 


EXERCISES 


34.1 In Example 34.1, show by use of Exercise 9.13 that (34.3) implies the biasedness 
of m/n for @. 


34.2 Referring to Example 34.6, sketch the OC curve for a binomial with « = 0-01, 
B = 0:03, 3, = 0-71, w, = 0:2. (The curve is half a bell-shaped curve with a maximum 
at @ = 0 and zero at@ = 1. Six points are enough to give its general shape.) Similarly, 
sketch the ASN curve for the same binomial. 


34.3 Two samples, each of size n, are drawn from populations, P,; and P., with pro- 
portions @, and @, of an attribute. They are paired off in order of occurrence. ty, is the 
number of pairs in which there is a success from P, and a failure from P, ; t, is the number 
of pairs in which there is a failure from P, and a success from P,. Show that in the (con- 
ditional) set of such pairs the probability of a member of f, is 


= (1—@)) @,/{@, (1 — D2) + De (1 — Dj) }. 
Considering this as an ordinary binomial in the set of t = t;+t, values, show how to 


test the hypothesis that w, > @, by testing @ = 4. Hence derive a sequential test for 
0) W;. 


SEQUENTIAL METHODS 621 


= at = 7,1) 
If “= : 
@, (1 — We) 
show that @ = u/(1+w) and hence derive the following acceptance and rejection numbers : 
lo lo iss 
ET ey pee i 
“1 = Tog u,—log uy logu,—loguy’ 
log Se log ee 
- 1 + Uo 


t log u,—log uy = ‘ Tog u,—log ue ; 


where u; is the value of u corresponding to H; (i = 0, 1). 
(Wald, 1947) 


34.4 Referring to the function h # 0 of 34.14 show that if z is a random variable 
such that E (z) exists and is not zero ; if there exists a positive 6 such that P(e* < 1—6) > 0 
and P(e? > 1+6) > 0; and if for any real h, E(exphz) = g(h) exists, then 

im iki SO = him. gi) 
h—->o h—>-—o 
and that g’’(h) > 0 for all real values of h. Hence show that g(h) is strictly decreasing 
over the interval (— 0, h*) and strictly increasing over (h*, ©), where h* is the value 
for which g(h) is a minimum. Hence show that there exists at most one h # 0 for which 
E(exp hz) = 1. 
(Wald, 1947) 


34.5 In 34.31, deduce the expressions (34.96—7). 
(cf. Johnson, 1959b) 


34.6 In Exercise 34.5, show that the third moment of Z,—np is 
E(Zn—np)® = bs E(n)—30°E {n(Zn—np)}, 


where yz is the third moment of z. 7 
(Wolfowitz, 1947) 


34.7 If zis defined as at (34.19), let t be a complex variable such that E (exp zt) = ¢(t) 
exists in a certain part of the complex plane. Show that 


E [{exp (t Zn) }{¢()}-"] = 1 
for any point where | ¢(t)| > 1. 
(Wald, 1947) 


34.8 Putting t = h in the foregoing exercise show that, if Ey, refers to expectation 
under the restriction that Z, < —b and E, to the restriction Z, > a, then 
K (h) E,yexp (h Zn) + {1-—K (h) }Egexp (h Zn) = 1, 
where K is the OC. Hence, neglecting end-effects, show that 
eh(a+b) _ phd 


SUD ay 


h #0, 


a 
Te = Q. 


(Girshick, 1946) 
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34.9 Differentiating the identity of Exercise 34.7 with respect to t and putting t = 0, 
show that 
_ af1-K(h)}-bK(A) 
E(x) = E(z) 


and hence derive equation (34.43) 


(Girshick, 1946) 


34.10 Assuming, as in the previous exercise, that the identity is differentiable, 
derive the results of Exercises 34.7 and 34.8. 


34.11 In the identity of Exercise 34.7, put 


—log¢(t) =t 
where T is purely imaginary. Show that if ¢(¢) is not singular at t = 0 and t = h, this 
equation has two roots ¢,(t) and t,(t) for sufficiently small values of t. In the manner 


of Exercise 34.8, show that the characteristic function of m is given asymptotically by 
t,__ At (Re 
E(é") = As —A1+ Bi— Bs 
Bats — ABs 


(Wald, 1947) 


34.12 In the case when 2 is normal with mean yu and variance o?, show that ¢, and f, 
in Exercise 34.11 are 


#4 1 
i= = ge ee 
1 
— —5-—(u?— 20%)! 


where the sign of the radical is determined so that the real part of u?—20°7 is positive. 
In the limiting case B = 0, A finite (when of necessity E'(z) > 0 if E(n) is to exist), 
show that the c.f. is 
A-4 
and in the case B finite, A = 0 (when E(z) < 0), show that the c.f. is 


B-4, 
(Wald, 1947) 


34.13 In the first of the two limiting cases of the previous exercise, show that the 
distribution of m = yw?n/207 is given by 


c - 
dF (m) = 27 (4) m3/2 *P (~ga-mte) am O<m< o, 


where c = log A/o?. 
For large c show that 2m/c is approximately normal with unit mean and variance 1/c. 


(Wald, 1947, who also shows that when A, B are finite the distribu- 
tion of n is the weighted sum of a number of variables of the 
above type.) 


34.14 Values of u are observed from the exponential distribution 
dF = e~jdu, 0O<u< om, 


SEQUENTIAL METHODS 623 
Show that a sequential test of A = Ay against 2 = A, is given by 
n n 
ky+(Ay—Ao) 2s ug < nlog(Ay/A9) < ky +(A1—Ao) 2 uj, 
j=1 j= 


where k, and kg are constants. 
Compare this with the test of Exercise 34.3 in the limiting case when @, and @, tend 
to zero so that @,t = Ay and Wet = A, remain finite. 
(Anscombe and Page, 1954) 


34.15 It is required to estimate a parameter 6 with a small variance a(6)/A when 
A tends to infinity. If tm is an unbiassed estimator in samples of fixed size m with vari- 
ance v(9)/m; if (tm) = O(m-*) and ye(tm) = O(m—); and if a(tm) and b(tm) can 
be expanded in series to give asymptotic means and standard errors, consider the double 
sampling rule: 
(a) Take a sample of size NA and let t, be the estimate of 6 from it. 
(b) Take a second sample of size max {0, [ {79 (t;) -N}4]} where ny (t;) = v(t,)/a(t,). 
Let t, be the estimate of 0 from the second sample. . 
Nt, + {1o(ti) -N} te 
No (£1) 
(d) Assume that N < n,(@) and the distribution of m,(t,) = 1/ng(t,) is such that the 
event 1,(t,) < N may be ignored. 
Show that under this rule 
E(t) = 04+0(4-4), 
vart = a(6)A7?{1+0(A-)}. 


(c) Let t= if my (t,) = N. 


i. K. fax, 1952c) 


34.16 In the previous exercise, take the same procedure except that m)(t,) is re- 


placed by 
b(t 
n (t,) = To dt “+ 8; 
Show that 
E(t) = 0+, (6) v(0)A-14+ O (A-*). 
Put 


ai 


t—m(t)uv(t)A if N < n(t,) 
= 0 otherwise, 
and hence show that t’ has bias O(A~-?). 
Show further that if we put 


: (8) = 19 (8) v (8) {229 (8) me) (9) yx (8) v-? (8) + mg? (8) + 2g (8) my’ (6) +g’ (6)/(2N)}, 
then 
var t’ = a(0)4-1+ O(A-). 
(D. R. Cox, 1952c) 


34.17 Applying Exercise 34.15 to the binomial distribution, with 
(1-20) 


a(@) =a@’, v(m) = a(i-o), y1(@) = {ad—o)}’ 


show that the total sample size is 
pea 
is al, t,(1 —?,) aNt, 
at 


and the estimator t’ = — 
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Thus N should be chosen as large as possible, provided that it does not exceed 
(1 —@) /(ad). (D. R. Cox, 1952c) 


34.18 Referring to Example 34.9, show that 


«oO S) AG) G5) 


where /1 is given by 
of 22 fee Se x 2 
Pei Sees Fe ee ae 


provided that the expression in brackets on the right is positive. Hence show how to 
draw the OC curve. (Wald, 1947) 


34.19 In the previous exercise derive the expression for the ASN 


K(o) {ho == h, } + hy where yY — log c/o) / (3 -3) < (Wald, 1947) 
0 ] 


o?—y 


34.20 Justify the statement in the last sentence of Example 34.9, giving a test of 
normal variances when the parent mean is unknown. (Girshick, 1946) 


34.21 In 34.10, f(x, 6) is replaced by f(x, 6, 6) where ¢ is a nuisance parameter, so 
that Hy and H, are composite. (6n, dn) is the ML estimator of (0, 9) after n observations. 
If 0) and 6, differ from the true value of 6 by amounts of order n—?, show that a SPR test 
based on 


tn($) = log Ln (x, 61, 6) —log Ln (x, 9, 4) 
is asymptotically equivalent to (34.18) for the simple hypothesis when ¢ is known, if and 
only if 
1 0? log Ln (x, 9, $) 
n 00 0¢ 


— 0 


in probability, i.e. if 6 and ¢ are asymptotically independent. In any case, show that 


= 2 07 log Ln 
tn (P)~ (81 — 90) {8 — (89 + 93) } Bf = 002 , 


so that Tn = n{@—1(0,+0,)} may be used as a test statistic with mean and variance at 
once obtained from those of 6. (D. R. Cox, 1963) 


34.22 For a sample from the distribution f(x | 0) = g(x)/h(0), a<x<06, show that 
the SPR test of Hy): 0 = 6) against H, : 6 = 0, (0<6)<6,) has « = 0, and has the form:— 
accept Hy, if xn >0),n=1,2,... 3 accept Hy if xn <6)and (6,/6,)"<f; continue sampling 
otherwise. 

Show directly that when H, holds, the sample size is constant at my) = log B/log (6)/6;), 


6 
and that when H, holds the ASN is exactly ( 1— A) i ( _ = if m) is an integer. Verify 
af 


these formulae for the ASN from (34.43). 


WwW dO -— 
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The frequency function of the normal distribution 
The distribution function of the normal distribution 
Quantiles of the d.f. of y? 


The distribution function of y? for one degree of freedom, 0<y?<1 


Quantiles of the d.f. of ¢ 
5 per cent points of z 
5 per cent points of F 
1 per cent points of z 
1 per cent points of Ff 
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A : nae 1 = 
Appendix Table 1 Frequency function of the normal distribution y = Von) with 
first and second differences 
x y A! (—) a* x y ae See A? 
ole) 0°39894 199 — 392 2°5 0°01753. 395 P79 
orl 0°39695 591 — 374 2°6 0°01358 316 + 66 
o'2 0°39104 965 — 347 a7 0'01042 250 53 
03 0°38139 1312 — 308 2°8 000792 197 +45 
O'4 0°36827 1620 — 265 2°9 0°00595 152 +36 
O'5 0'35207 1885 —212 3°0 0'00443 116 +27 
06 0°33322 2097 —159 ZI | ©606327 89 +23 
O'7 0'31225 2256 — 104 2 | ooegets 66 +17 
08 0:28969 2360 — 52 3 0'00172 49 +13 
o'9 0'26609 2412 ° 3°4 0°00123 36 +10 
1‘O 0'24197 2412 + 46 3°5 0:00087 26 + 7 
a | 0'21785 2366 + 84 3°6 000061 19 + 6 
3 O'19419 2282 +118 BS 0°00042 13 + 4 
13 0°17137 2164 +143 35 000029 9 + 2 
I°4 014973 2021 +161 3°9 0°00020 7 + 3 
I'5 0°12952 1860 +173 | 4°0 0°00013 4 a 
1°6 O'11092 1687 +177 4’I 0'00009 3 — 
1°7 0'09405 I510 +177 4°2 0:00006 2 _ 
1°8 0°07895 1333 +170 4°3 0°00004. 2 — 
I'9 0:06562 1163 +162 4°4 0700002 _- —_ 
2°0 0°05399 1001 +150 4°5 0:00002 — — 
2'1 0'04398 851 +137 4°6 000001 — — 
2'2 0'03547 714 4-220 4°7 roholelolop i — —_ 
23 0'02833 594 +108 4°8 ©:00000 —~ — 

2°4 002239 486 + QI 


Appendix Table 2 Distribution function of the normal distribution 


The table shows the area under the curve y = (27) 
e.g. the area corresponding to a deviate 1°86 (= 1°5 + 0°36) is 0°9686. 


Deviate 


0:00 
O'OI 
0°02 
0°03 
0°04 
0°05 
0°06 
0°07 
0°08 
009 
O10 
O'll 
O°12 
0°13 
O14 
O'l5 
0-16 
O'17 
018 
O°19 
0°20 
O'21 
0°22 
0°23 
0°24 
0°25 
0:26 
0°27 
0:28 
0°29 
0°30 
O31 
0°32 
0°33 
0°34 
0°35 
0°36 
0°37 
0°38 
0°39 
0°40 
o'4I 
0°42 
0°43 
0°44 
0°45 
0°46 
0°47 
0°48 
0°49 


o'o + 


5000 
5040 
5080 
5120 
5160 
5199 
5239 
5279 
5319 
5359 
5398 
5438 
5478 
5517 
5557 
5596 
5636 
5075 
5714 
5753 
5793 
5832 
5871 
5910 
5948 
5987 
6026 
6064 
6103 
6141 
6179 
6217 
6255 
6293 
6331 
6368 
6406 
6443 
6480 
6517 
6554 
6591 


| 6628 
6664 


6700 
6736 
6772 
6808 
6844 
6879 


s+ 


6915 
6950 
6985 
7019 
7°54 
7088 
7123 
7157 
7190 
7224 
7257 
7291 
7324 
7357 
7389 
7422 
7454 
7486 
757 
7549 
7580 
7611 
7642 
7673 
7704 
7738 
7764 
7794 
7823 
7852 
7881 
7910 
7939 
7967 
7995 
8023 
8051 
8078 
8106 
8133 
8159 
8186 
8212 
8238 
8264 
8289 
8315 
8340 
8365 
8389 
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ze—2@* lying to the left of specified deviates x ; 


I'o + 


8413 
8438 
8461 
8485 
8508 
8531 
8554 
8577 


8599 
8621 


8643 
8665 
8686 
8708 
8729 
8749 
8770 
8790 
8810 
8830 
8849 
8869 
8888 
8907 
8925 
8944 
8962 
8980 
8997 
goI5 
9032 
9049 
9066 
9082 
9099 
QI15 
QI3I 
9147 
9162 
9177 
9192 
9207 
9222 
9236 
9251 
9265 
9279 
9292 
9306 
9319 


is + 


9332 
9345 
9357 
9370 
9382 
9394 
9406 
9418 
9429 
9441 
9452 
9463 
0474 
9484 
9495 
9595 
9515 
9525 
9535 
9545 
9554 
9564 
9573 
9582 
9591 


9599 
9608 


9616 
9625 
9633 
9641 
9649 
9656 
9664 
9671 
9678 
9686 
9693 
9699 
9706 
9713 
9719 
9726 
9732 
9738 
9744 
9750 
9756 
9761 
9767 


2°0 + 


9772 
9778 
9783 
9788 
9793 
9798 
9803 
9808 
9812 
9817 
9821 
9826 
9830 
9834 
9838 
9842 
9846 
9850 
9854 
9857 
9861 
9864 
9868 
9871 
9875 
9878 
9881 


9887 
9890 
9893 
9896 
9898 
ggol 
9904 
9906 
9909 
OggII 
9913 
9916 
9918 
9920 
9922 
9925 
9927 
9929 
9931 
9932 
9934 
9936 


9884 


25 + 


97379 
97396 
97413 
97430 
97446 
97461 
97477 
97492 
97506 
97520 
97534 
97547 
97560 
97573 
97585 
97598 
97609 
97621 
97632 
97643 
97653 
97664 
9°674 
97683 
97693 
97702 
9°711 
9°720 
9°728 
97736 
97744 
97752 
97760 
9°767 
9°774 
9°781 
97788 
97795 
9*801 
9°807 
97813 
97819 
97825 
97831 
97836 
97841 
97846 
97851 
97856 
97861 


Nofitofitofitolh oli olka oko oli ol oli oli oli oli ol oleh to 
Md wo 


30+ 


9°69 
9°70 
9°71 
9°72 
9°73 
9°74 
9°75 
9°76 


Note—Decimal points in the body of the table are omitted. Repeated 9’s are indicated 


by powers, e.g. 9°71 stands for 0-99971. 
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Appendix Table 4a Distribution function of x? for one degree of freedom for values 


x2 =0 to x2 = 1 by steps of 0-01 


P A 
I*0O0O0O0 7966 
0°92034 3280 
0°88754 2505 
086249 2101 
084148 1842 
0:82306 1656 
080650 1516 
0°79134 1404 
0°77739 1312 
0°76418 1235 
0°75183 1169 
0°74014 IIII 
0°72903 1060 
0°71843 IOI5 
0°70828 974 
0°69854 938 
068916 905 
068011 874 
0°67137 845 
066292 820 
0°65472 795 
0°64677 773 
063904 a53 
0°63152 731 
0'62421 713 
061708 696 
061012 679 
0:60333 663 
059670 648 
0°59022 634 
0758388 620 
0'57768 607 
057161 595 
0°56566 583 
0°55983 572 
O°55411 560 
0°54851 551 
0°54300 540 
0°53760 530 
0°53230 521 
0°52709 St2 
0°52197 503 
0'51694 495 
O°51199 487 
0'50712 479 
0°50233 471 
0°49762 463 
0°49299 457 
0°48842 — 449 
0°48393 443 
0°47950 


x* Fas 
0°50 0°47950 
O°5I 0°47514 
0°52 0°47084 
0°53 046661 
0°54 0°46243 
0°55 0°45832 
0°56 0°45426 
0°57 0°45026 
0°58 0°44631 
0°59 0°44242 
0:60 0°43858 
0°61 0°43479 
0°62 0°43105 
0°63 0°42736 
0°64 0°42371 
0°65 0°42011 
0:66 0'41656 
0°67 0°41305 
0°68 0°40959 
0:69 0:40616 
0°70 0°40278 
o°71 0°39944 
O72 0°39614 
0°73 0°39288 
0'74 0:38966 
O'75 0°38648 
0°76 0°38333 
oF77 0°38022 
0°78 0°37714 
0°79 0'37410 
0°80 0°37109 
o:81 036812 
0°82 036518 
0°83 0'36227 
0°84 0°35940 
0°85 0°35655 
0°86 0°35374 
0°87 0°35096 
0°88 0'34820 
089 0°34548 
0-90 0°34278 
O'9I 0'34011 
0°92 0°33747 
0°93 0°33486 
0°94 0°33228 
0°95 032972 
0-96 0°32719 
0°97 0°32468 
0:98 0°32220 
2S¢ 0°31974 
I°0O 0°31731 


72 
2°70 


256 
253 


246 


243 
241 


630 APPENDIX TABLES 


Appendix Table 4b Distribution function of x? for one degree of freedom for values 
of x? from 1 to 10 by steps of 0-1 


= P A x? r A 
| axe) 0°31731 2304 5°5 0°01902 106 
I'l 0°29427 2095 5°6 0:01796 99 
2 0'27332 IQII 6-4 0:01697 94 
1°3 0°25421 1749 5°38 0°01603 89 
1°4 0°23672 1605 5°9 O°OI1514 83 
I°5 0'22067 1477 6-0 O'01431 79 
1°6 0°20590 1361 6-1 0°01352 74 
1°7 0°19229 1258 6:2 001278 71 
1°8 O°17971 1163 6°3 0°01207 66 
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Appendix Table 6 5 per cent. points of the distribution of z 
(values at which the d.f. = 0-95) 


(Reprinted from Table VI of Sir Ronald Fisher’s Statistical Methods for Research Workers, 
Oliver and Boyd Ltd., Edinburgh, by kind permission of the author and publishers) 
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Appendix Table 7 5 per cent. points of the variance ratio F 
(values at which the d.f. = 0°95) 


(Reproduced from Sir Ronald Fisher and Dr F. Yates: Statistical Tables for Biological, 
Medical and Agricultural Research, Oliver and Boyd Ltd., Edinburgh, by kind permission 
of the authors and publishers) 


a = I 2 3 4 5 6 8 12 24 00 
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6 ee 59s 4-76 «4°53 4990—C 28 gis: tO 28384 3°67 
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Lower 5 per cent. points are found by interchange of », and 7, i.e. », must always correspond 


to the greater mean square. 
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Appendix Table 8 1 per cent. points of the distribution of z 
(values at which the d.f. = 0°99) 


(Reprinted from Table VI of Sir Ronald Fisher’s Statistical Methods for Research Workers, 
Oliver and Boyd Ltd., Edinburgh, by kind permission of the author and publishers) 
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Appendix Table 9 1 per cent. points of the variance ratio F 
(values at which the d.f. = 0°99) 


(Reproduced from Sir Ronald Fisher and Dr F. Yates: Statistical Tables for Biological, 
Medical and Agricultural Research, Oliver and Boyd Ltd., Edinburgh, by kind permission 
of the authors and publishers) 
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Lower I per cent. points are found by interchange of ¥, and ¥,, i.e. ¥; must always correspond 
to the greater mean square. 
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Alternative hypothesis (H,), see Hypotheses. 
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(Exercise 33.3) 584. 
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Ancillary statistics, 217. 
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controlled variables, 409; tests of fit, 452; 
efficiency of SPR tests, 611. 

Andrews, F. C., ARE of Kruskal-Wallis test, 
504. 

Anscombe, F. J., ML estimation in negative 
binomial, (Exercises 18.26-7) 72-3; out- 
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estimation, 616; sequential test in the ex- 
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622. 

Arbuthnot, J., early use of Sign test, 513 foot- 
note. 

ARE, see Asymptotic relative efficiency. 
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578 ; X? dispersion tests for stratified popu- 
lations, 580; sequential tests, 607, 614, 619. 

Armsen, P., tables of exact test of independence 
in 2 x 2 table, 553. 

Arnold, H. J., power of Wilcoxon symmetry 
test, 508. 
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Askovitz, S. I., graphical fitting of regression 
line, (Exercise 28.19) 373. 

ASN, average sample number, 598-9. 

Aspin, A. A., tables for problem of two means, 
148. 

Association, in 2 x 2 tables, 536-41; and inde- 
pendence, 537 footnote; in rxc tables, 
561-74; partial, 541-5, 580-3; see Cate- 
gorized data. 


Asymptotic relative efficiency (ARE), 265-77; 
definition, 266; and derivatives of power 
function, 267—71; tests which cannot be 
compared by, 267, 269-70; and maximum 
power loss, 272-3; and estimating effi- 
ciency, 273-4; and correlation, 274, (Exer- 
cise 25.9) 277; non-normal cases, 274-5. 

Attributes, sampling of, see Binomial distribu- 
tion. 

Average sample number, 598-9. 


Bahadur, R. R., efficiency, 44; multiplicity of 
ML estimators, (Exercise 18.35) 74. 

BAN (Best asymptotically normal) estimators, 
91-5. 

Banerjee, K. S., estimation of linear relation, 
405. 

Barankin, E. W., bounds for variance, 17; 
sufficiency, 28; minimal sufficiency, 194 
footnote. 

Barnard, G. A., optimum LS properties, 79 
footnote; frequency justification of fiducial 
inference, 158; models for the 2 x 2 table, 
550; sequential methods, 592. 

Barnett, V. D., iterative ML estimation, 49, 
50; Cauchy location estimation, 50. 

Barr, D. R., testing non-regular distributions, 
240. 

Bartholomew, D. J., efficiency of k-sample test, 
506. 

Bartky, W., rectifying inspection, 607. 

Bartlett, M. S., approximate confidence inter- 
vals, 112, 128; conditional distribution in 
problem of two means, (Exercise 21.10) 
160; sufficiency and similar regions, 189; 
quasi-sufficiency, 217; approximations to 
distribution of LR statistic, 233; modifica- 
tion of a LR statistic, (Example 24.4) 235; 
conditional c.f., 319; adjustment in LS for 
extra observation, (Exercise 28.19) 373; 
estimation of functional relation, 404—5, 
(Exercise 29.11) 417; robustness, 465-7; 
interaction in multi-way tables, 583. 

Barton, D. E., ‘‘ smooth” tests of fit, 446—- 
450. 

Basu, A. P., outliers for exponential distribu- 
tions, 530. 
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Basu, D., sufficiency, completeness and inde- 
pendence, (Exercise 23.7) 219. 

Bateman, G. I., non-central normal variates 
subject to constraints, (Exercise 24.2) 257. 

Bayesian intervals, 150-2; critical discussion, 
152-4; and fiducial theory, 154-7; and 
confidence intervals, 157. 

BCR, best critical region, 165; see Tests of 
hypotheses. 

Beatty, G. H., tables of normal tolerance limits, 
130. 

Behrens, W. V., fiducial solution to problem of 
two means, 149. 

Bell, C. B., tests using random normal deviates, 
487, 

Bennett, B. M., test of independence in 2 x 2 
table, 553, 555. 

Benson, F., estimation using best two order- 
statistics, (Exercise 32.14) 533. 

Berger, A., comparisons of 2 x 2 tables, 555. 

Berkson, J., sampling experiments on BAN 
estimators, 95; choice of test size, 182; 
controlled variables, 408. 

Bernstein, S., characterization of bivariate nor- 
mality, 353 footnote. 

Best asymptotically normal (BAN) estimators, 
91-5. 

Bhattacharjee, G. P., non-normality and Stein’s 
test, 619. 

Bhattacharyya, A., lower bound for variance, 
12; covariance between MVB estimator 
and unbiassed estimators of its cumulants, 
(Exercise 17.4) 31-2; characterizations of 
bivariate normality, (Exercises 28.7—11) 
371 

Bhuchongkul, S., tests of independence, 487. 

Bias in estimation, 4-5; corrections for, 5-7, 
(Exercises 17.13, 17.17-18) 33; see Un- 
biassed estimation. 

Bias in tests, 200; see Unbiassed tests, Tests of 
hypotheses. 

Bickel, P. J., robustness and efficiency of loca- 
tion estimators, 469, 525. 

Bienaymé—Tchebycheff inequality, and consis- 
tent estimation, (Example 17.2) 3. 

Binomial distribution, unbiassed estimation of 
square of parameter, 0, (Example 17.4) 6; 
MVB for 0, (Example 17.9) 11; estimation 
of 0 (1—@), (Example 17.11) 15; unbiassed 
estimation of functions of 6, (Exercise 
17.12) 33; estimation of linear relation be- 
tween functions of parameters of in- 
dependent binomials, (Exercise 17.25) 34; 
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ML estimator biassed in sequential samp- 
ling, (Exercise 18.18) 70; equivalence of 
ML and MCS estimators, (Exercise 19.15) 
97; confidence intervals, (Example 20.2) 
103, 118-20, (Exercise 20.8) 131; tables 
and charts of confidence intervals, 118; 
confidence intervals for the ratio of two 
binomial parameters, (Exercise 20.9) 131; 
prediction by Bayesian and confidence 
methods, 157; fiducial intervals, (Exercise 
21.3) 159; testing simple H, for 0, (Exer- 
cise 22.2) 184, 212; minimal sufficiency, 
(Exercise 23.11) 220; truncated, 527; 
homogeneity (dispersion) test, 578-80, 
‘(Exercises 33.21-2) 588-9; (inverse) se- 
quential sampling, (Examples 34.1—7) 592— 
607, (Exercises 34.1-3) 620; MVB in se- 
quential estimation, (Example 34.11) 615; 
double sampling, (Exercise 34.17) 623; see 
Sign test. 

Birch, M. W., ML estimation of structural 
relation, 381, (Exercises 29.18) 418; limit- 
ing distribution of X?, 425; multi-way 
tables, 583. 

Birnbaum, A., Conditionality and Likelihood 
Principles, 217; logistic order statistics, 
(Exercise 32.12) 533; doublesampling, 619. 

Birnbaum, Z. W., tabulation of Kolmogorov 
test, 457; one-sided test of fit, 458, 460; 
computation of Kolmogorov statistic, (Ex- 
ample 30.6) 460-1. 

Biserial correlation, 307-11, (Exercises 26.5, 
26.10-12) 313-14. 

Bivariate normal distribution, ML estimation 
of correlation parameter, p, alone, (Ex- 
ample 18.3) 38; indeterminate ML esti- 
mator of a function of p, (Example 18.4) 
42; asymptotic variance of p, (Example 
18.6) 44; ML estimation of various com- 
binations of parameters, (Example 18.14) 
57, (Example 18.15) 59, (Exercises 18.11- 
14) 69-70; charts of confidence intervals 
for p, 118; confidence intervals for ratio of 
variances, (Exercise 20.19) 133; power of 
tests for p, (Example 22.7) 172; joint c.f. 
of squares of variates, (Example 26.1) 283; 
linear regressions of squares of variates, 
(Example 26.3) 285, (Example 26.5) 286; 
estimation of p, 293-5; confidence inter- 
vals and tests for p, 295-6; tests of in- 
dependence and regression tests, 296; 
correlation ratios, (Example 26.8) 298; re- 
gressions, linear, homoscedastic, (Exercise 
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26.2) 312, (Example 28.1) 349; estimation 
of p when parent dichotomized, (Exercises 
26.3-4) 313; distribution of regression 
coefficient, (Exercise 26.9) 313; ML estima- 
tion in biserial situation, (Exercises 26.10— 
12) 313-14; LR test for p, (Exercise 26.15) 
315; ML estimation and LR tests for com- 
mon correlation parameter of two distribu- 
tions, (Exercises 26.19-22) 315-16; joint 
distribution of sums of squares, (Example 
28.2) 349; characterizations, 352-3, (Exer- 
cises 28.7—-11) 371; robustness of tests for 
p, 468-9; efficiencies of tests of indepen- 
dence, 473, 481-2; efficiencies of tests of 
regression, 484-7; UMPU tests of p = 0, 
(Exercise 31.21) 512; interpretation of 
contingency coefficient, 557; transforma- 
tions and canonical correlations, 568-9. 

Blackwell, D., sufficiency and MV estimation, 
25; sequential sampling, 603; decision 
functions, 620. 

Blalock, H. M., Jr., interpretation of X? 
measures, 561. 

Bliss, C. I., outlying observations, 529. 

Bloch, D., Cauchy location estimation, 50. 

Blom, G., methods in logistic distribution, 
(Exercise 18.29) 73; simple estimation in 
censored samples, 524; asymptotic MVB, 
(Exercises 32.8-11) 531-2. 

Blomavist, N., ARE of tests, 268; medial cor- 
relation test, (Exercise 31.7) 510. 

Blum, J. R., distribution of Hoeffding’s test, 
483. 

Blyth, C. R., tables of binomial and Poisson 
confidence intervals, 120. 

Bose, R. C., simultaneous confidence intervals, 
128. 

Bounded completeness, see Completeness. 

Bowker, A. H., tables of tolerance limits, 130; 
approximations for tolerance limits, (Exer- 
cises 20.20—2) 133; test of complete sym- 
metry in r Xr table, (Exercise 33.23) 589. 

Bowman, K., cumulants of ML estimators, 48. 

Box, G. E. P., approximations to distribution of 
LR statistic, 234; robustness, 465, 468. 

Brandner, F. A., ML estimation and LR tests 
for bivariate normal correlation para- 
meters, (Exercises 26.19-—22) 315-16. 

Brillinger, D. R., removal of ML bias, 42; 
fiducial paradoxes, 155; maximal corre- 
lation with order-statistics scores, 487. 

Bross, I. D. J., charts for test of independence 
in 2 x 2 table, 553. 
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Brown, G. W,, median test of randomness, 
(Exercise 31.12) 511. 

Brown, L., sufficient statistics, 26. 

Brown, M. B., problem of two means, 147. 

Brown, R. L., data on functional relation, 
(Example 29.1) 388; confidence intervals 
for functional relations, 391; linear func- 
tional relations in several variables, 394, 
(Exercises 29.6-8) 416-17. 

Buehler, R. J., fiducial inference, 155. 

Bulmer, M. G., confidence limits for distance 
parameter, 561. 

Burman, J. P., sequential sampling, 605. 

Byars, B. J., tables of non-central ¢, 255. 


c, (two-sample) test, 498-503. 

Canonical analysis of r x c tables, 568-75. 

Capon, J., scale-shift tests, 503; locally most 
powerful rank tests, (Exercise 31.17) 511. 

Categorized data, 536-91; measures of associa- 
tion, 536-41, 556-9, 561, (Exercises 33.1— 
4, 33.6) 584-5; partial association, 541-5; 
probabilistic interpretations, 545-7; large- 
sample tests of independence, 547-9, 559- 
561; exact test of independence, different 
models, 549-55, 559; continuity correc- 
tion, 555-6, (Exercise 33.5) 584; ordered 
tables, rank measures, 562-6, (Exercises 
33.10-11) 586; ordered tables, scoring 
methods and canonical analysis, 566-74, 
(Exercises 33.12—-14) 586-7; use of LS, 
568; partitions of X*, canonical compo- 
nents, 574-8, (Exercise 33.13) 586; binomial 
and Poisson dispersion tests, 579-80; 
multi-way tables, 580-3; test of complete 
symmetry, (Exercise 33.25) 589; test of 
identical margins, (Exercise 33.24) 589; 
ML estimation of cell probabilities, (Exer- 
cise 33.29) 590; test of identical margins in 
2° table, (Exercise 33.30) 590. 

Cauchy distribution, uselessness of sample 
mean as estimator of median, 2-3; sample 
median consistent, (Example 17.5) 8; 
MVB for location, (Example 17.7) 11; ML 
and order-statistics estimators of location, 
(Example 18.9) 49; testing simple H, for 
median, (Example 22.4) 168, (Exercise 
22.4) 184; completeness, 190, (Exercise 
23,9) 219. 

Censored samples, 522-7; see also under names 
of distributions. 

Central and non-central confidence intervals, 
102. 
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Centre of location, 64. 

Chanda, K. C., Wilcoxon test in discrete cases, 
498. 

Chandler, K. N., relations between residuals, 
(Exercise 27.5) 343. 

Chandra Sekar, C., outlying observations, 529. 

Chapman, D. G., bound for variance of esti- 
mator, 17; estimation of normal standard 
deviation, (Exercise 17.6) 32; truncated 

. Gamma distribution, 527; double samp- 
ling, 619. 

Characteristic functions, and completeness, 190; 
conditional, 319-21. 

Chernoff, H., asymptotic expansions in prob- 
lem of two means, 148; reduction of size 
of critical regions, 177 footnote; measure 
of test efficiency, 276, (Exercises 25.5-6) 
277; distribution of X? test of fit, 428, 430; 
ARE of Fisher—Yates test, 498, 501; effi- 
cient estimation from order-statistics, 524. 

Chipman, J. S., singular LS estimation, 86. 

Chow, Y. S., moments in sequential estima- 
tion, 615. 

Clark, R. E., tables of confidence limits for the 
binomial parameter, 118. 

Clark, V. A., logistic order-statistics, (Exercise 
32.12) 530, 

Clitic curve, 347. 

Clopper, C. J., charts of confidence intervals 
for the binomial parameter, 118. 

Closeness in estimation, 8. 

Clunies-Ross, C. W., two-stage LS, (Exercise 
28.24) 374. 

Cochran, W. G., Sign test, (Exercise 25.1) 276, 
513-14; adjustment of regression analysis, 
356; critical region for X? test, 422; choice 
of classes for X? test, 440; limiting power 
function of X? test, (Exercise 30.4) 462; 
outlying observations, 529; X? in 2x2 
tables, 556; partitions of X, 578; dispersion 
tests, 579; test for partial association, 583; 
test of identical margins in 2° table, (Exer- 
cise 30.30) 590. 

Cohen, A. C., Jr., truncation and censoring in 
normal distributions, 525; truncation and 
censoring in Poisson distributions, 526, 
(Exercise 32.22) 535; truncated Gamma 
distribution, 527. 

Colligation coefficient, 539. 

Combination of tests, (Exercise 30.9) 463. 

Completeness, 190; and unique estimation, 
190; of sufficient statistics, 190-3; and 
similar regions, 196; in non-parametric 
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problems, 471-2; sequential, 614 foot- 
note. 

Complex contingency tables, see Multi-way 
tables. 

Components of X*, 449-50; in r x c tables, 
574-8. 

Composite hypotheses, 186-223; see Hypo- 
theses, Tests of hypotheses. 

Conditional tests, 217-19. 

Conditionality Principle, 217. 

Confidence belt, coefficient, 100; see Con- 
fidence intervals. 

Confidence distribution, 102, 155. 

Confidence intervals, generally, 98-133; dis- 
tributions with one parameter, 98-120; 
limits, 99; belt, coefficient, 100; graphical 
representation, 101, 105, 119, 121, 125; 
central and non-central, 102; discon- 
tinuous distributions, 103-5, 118-20; con- 
servative, 103-5; for large samples, 105-8, 
112; nestedness, 107-10; difficult cases, 
108-12, (Example 20.7) 125-7, (Exercise 
28.21) 373; shortest, 114-18; minimum 
average length in large samples, 115-17; 
most selective, 117, 123, 206; and tests, 
117-18, 206; unbiassed, 118, 206; tables 
and charts, 118; distributions with several 
parameters, 120-3; choice of statistic, 123; 
studentized, 123-7; simultaneous intervals 
for several parameters, 127-8; when range 
depends on parameter, (Exercises 20.13- 
16) 132; problem of two means, 139-48, 
(Exercise 21.9) 159; critical discussion, 
152-4, 158; for a continuous d.f., 457-8; 
for a shift in location (non-parametric), 
491, (Exercises 31. 24—5) 512; for quantiles, 
517. | 

Confidence limits, 99; see Confidence intervals. 

Confidence regions, 127-8, (Exercise 20.5) 131; 
for a regression, 365-70. 

Connell, T. L., double sampling for normal 
variance, 619. 

Consistency in estimation, 3-4, 92 footnote, 
262; of ML estimators, 39-42, 55, (Ex- 
ample 18.16) 61, (Exercise 18.35) 74. 

Consistency in tests, 240; of LR tests, 241. 

Constraints, imposed by a hypothesis, 163, 
186. 

Contingency, tables, 556; coefficient, 557; see 
Categorized data. 

Continuity corrections, in distribution-free 
tests, 508-9; in 2 x 2 tables, 555-6. 

Controlled variables, 408-9, 413. 


INDEX 


Convergence in probability, 3; see Consistency 
in estimation. 

Cornfield, J., fiducial inference, 157; tables of 
studentized maximum absolute deviate, 
529. 

Corrections, for bias in estimation, 5—7, (Exer- 
cises 17.17-19) 33; to ML estimators for 
grouping, (Exercises 18.24—5) 71-2. 

Correlation, between estimators, 18-19; gener- 
ally, 278-345; and interdependence, 278- 
279, 288; and causation, 279-80; coeffi- 
cient, 287-8; computation of coefficient, 
(Examples 26.6-—7) 289-92 ; scatter diagram, 
292; standard errors, 292; estimation and 
testing in normal samples, 293-6; ratios, 
296; and linearity of regression, 296-8, 
(Exercise 26.24) 316; computation of ratio, 
(Example 26.9) 298; LR tests of coefficient, 
ratio and linearity of regression, 299-301; 
intra-class, 302-4; tetrachoric, 304-7; bi- 
serial, 307-11; point-biserial, 311-12; co- 
efficient increased by Sheppard’s group- 
ing corrections, (Exercise 26.16) 315; 
attenuation, (Exercise 26.17) 315;spurious, 
(Exercise 26.18) 315; matrix, 317; deter- 
minant, 318; see also Multiple Correlation, 
Partial correlation, Regression. 

Covariance, 283; see Correlation, Regression. 

Cox, C. P., orthogonal polynomials, 360. 

Cox, D. R., estimation of linear relation be- 
tween functions of parameters of inde- 
pendent binomials, (Exercise 17.25) 34; 
confidence distribution, 102; conditional 
tests, 218; LR statistics, 247; ARE and 
maximum power loss, 273; regression of 
efficient on inefficient test statistic, (Exer- 
cise 25.7) 277; exponential scores, 502; 
ARE of simple tests of randomness, (Exer- 
cises 31.10-12) 510-11; sequential proce- 
dures for composite Hy, 614, (Exercise 
34.21) 624; double sampling, 619, (Exer- 
cises 34.15-17) 623. 

Craig, A. T’.,completeness of sufficientstatistics, 
191; completeness and independence, 
(Exercises 23.8-9) 220; LR test for rect- 
angular distribution, (Exercise 24.9) 258- 
259; sufficiency in exponential family with 
range a function of parameters, (Exercise 
24.17) 260. 

Cramér, H. MVB, 9; efficiency and asymptotic 
normality of ML estimators, 43; distribu- 
tion of X? test of fit, 425; test of fit, 451; 
coefficient of association, 557. 
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Cramér—Rao inequality, see Minimum Variance 
Bound. 

Creasy, M. A., confidence intervals for linear 
functional relation, 389. 

Crime in cities, Ogburn’s data, (Example 27.2) 
331-2. 

Critical region, see Tests of hypotheses. 

Cross-ratio, in 2 x 2 table, 547, 555. 

Crouse, C. F., Mood’s test, 503. 

Crow, E. L., tables of confidence limits for 
proportions and Poisson parameters, 118. 


Daly, J. F., smallest confidence regions, 128; 
unbiassed LR tests for independence, 245. 

Daniels, H. E., asymptotic normality and 
efficiency of ML estimators, 44; non- 
unique ML estimators, (Exercise 18.33) 
74; minimization of generalized variance by 
LS estimators, 81; estimation of a function 
of the normal correlation parameter, 294; 
coefficients of correlation and disarray, 
477, 563; joint distribution of rank corre- 
lations, 481, (Exercise 31.6) 510; theorem 
on correlations, (Exercise 31.3) 509. 

Dantzig, G. B., no test of normal mean with 
power independent of unknown variance, 
197. 

Dar, S. N., ratio of non-central F variables, 254. 

Darling, D. A., tests of fit, 452, 460; distribu- 
tion of ratio of sum to extreme value, 530. 

Darmois, G., sufficiency, 26, 28. 

Darroch, J. N., interactions in multi-way tables, 
583. 

David, F. N., modified MCS estimators, (Exer- 
cise 19.14) 97; charts for the bivariate nor- 
malcorrelation parameter, 118, 295; bias in 
testing normal correlation parameter, 295; 
runs test to supplement X? test of fit, 442, 
(Exercises 30.7, 30.9) 462-3; probability 
integral transformation with parameters 
estimated, 442, (Exercise 30.10) 463; 
“smooth ”’ tests of fit, 446; bibliography 
of order-statistics, 513. 

David, H. A., tables of studentized deviates 
and of studentized range, 529. 

David, S. 'T., difficult confidence intervals, 109; 
joint distribution of rank correlations, 481. 

Davis, R. C., sufficiency with terminals depen- 
dent on the parameter, 29. 

Decision functions, 620. 

Deemer, W. L., Jr., truncation and censoring 
for the exponential distribution, 526, 
(Exercise 32.16) 533. 
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Degrees of freedom, of a hypothesis, 163, 
186. 

De Groot, M. H., sequential sampling for 
attributes, 614 footnote. 

Dempster, A. P., fiducial inference, 154, 155. 

Den Broeder, G. G., Jr., truncated and cen- 
sored Gamma variates, 527. 

Dependence and interdependence, 278-9. 

Disarray, 477. 

Diseontinuities, and confidence intervals, 103- 
105, 118-20; and tests, 166-7; effect on 
distribution-free tests, 508-9; correction in 
2 x 2 tables, 555-6. 

Dispersion tests, binomial and Poisson, 578— 
580. 

Distribution-free procedures, tests of fit, 443-4, 
451-2; confidence intervals for a con- 
tinuous d.f., 457; in general, 469; and non- 
parametric problems, 470; classification of, 
470-1; construction of tests, 471-2; effi- 
ciency of tests, 472-3 ; tests of independence, 
473-83,486-7 ; tests of randomness, 483-7; 
two-sample tests, 487-503; confidence 
intervals for a location shift, 491, (Exercise 
31.17) 511; k-sample tests, 503-6; tests of 
symmetry, 506-8; tests for quantiles, 513-— 
517; confidence intervals for quantiles, 517— 
518; tolerance intervals, 518-21; tests for 
outliers, 530-1; categorized data, 536-91; 
sequential tests, 619-20. 

Distribution function, sample, 450-1; con- 
fidence limits for, 457-8. 

Dixon, W. J., efficiency of tests, 265; power of 
Wilcoxon test, 498; Sign test, 514-16; 
censored normal samples, 525; outlying 
observations, 528-9. 

Dodge, H. F., double sampling, 607. 

Doksum, K. A., tests using random normal 
deviates, 487. 

Dorff, M., estimation of functional and struc- 
tural relations, 383, 408. 

Double exponential (Laplace) distribution, 
MLestimation of mean, (Example 18.7)45, 
(Exercise 18.1) 67; testing against normal 
form, (Example 22.5) 169; Sign test 
asymptotically most powerful for location, 
(Exercise 32.5) 531. 

Double sampling, 607, 618-19, (Exercises 
34.15-17) 623. 

Downton, F., ordered LS estimation, 87; 
extreme-value distribution, 527. 

Dudman, J., logistic order-statistics, (Exercise 
32.12) 533. 
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Duncan, A. J., charts of power function for 
linear hypothesis, 253. 

Durbin, J., generalization of MVB, (Exercise 
18.32) 74; geometry of LS theory, 80; 
supplementary information in regression, 
(Exercises 28.17-18) 372; instrumental 
variables, 399, (Exercise 29.17) 418; tests 
of fit and distribution of rectangular order- 
statistics, (Exercise 30.17) 464; removing 
nuisance parameter by discarding sufficient 
statistic, (Exercise 30.18) 464; test of 
diagonal proportionality in rxvr_ table, 
(Exercise 33.25) 589. 

Dynkin, E. B., minimal sufficiency, 194 


East, D. A., tables of confidence intervals for a 
normal variance, 118. 

Edwards, A. W. F., association and cross-ratio, 
547. 

Efficiency?, in estimation, and correlation be- 
tween estimators, 18-19; definition, 19; 
measurement, 19-21; partition of error in 
inefficient estimator, (Exercise 17.11) 32; 
of ML estimators, 42-6, 55-6; of method 
of moments, 65-7, (Exercises 18.9, 18.17, 
18.20, 18.27-8) 69-73; and power of tests, 
171-2, (Exercise 22.7) 184; and ARE of 
tests, 273-4. 

Efficiency of tests, 262-5; see Asymptotic Rela- 
tive Efficiency. 

Eisenhart, C., limiting power function of X? 
test of fit, 436; tables of runs test, (Exercise 
30.8) 463. 

El-Badry, M. A., ML estimation of cell prob- 
abilities in r x c table, (Exercise 33.29) 590. 

Ellison, B. E., tolerance intervals, 130. 

Epstein, B., an independence property of the 
exponential distribution, (Exercise 23.10) 
220: censoring in exponential samples, 
526-7, (Exercise 32.17) 534. 

Errors, in LS model, 76; from regression, 322; 
identical, 351; of observation and func- 
tional relations, 375-7; of observation in 
regression, 414-16. 

Estimates, and estimators, 2. 

Estimation, point, 1-97; interval, 98-160; and 
completeness, 190; see Efficiency in esti- 
mation. 

Estimators, and estimates, 2. 

Exhaustive, 193 footnote. 

Exponential distribution, sufficiency of smallest 
observation for lower terminal, (Example 
17.19) 30; sufficiency when upper terminal 
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a function of scale parameter, (Example 
17.20) 30; MV _ unbiassed estimation, 
(Exercise 17.24) 34; ML estimation and 
sufficiency when both terminals depend on 
parameters, (Exercise 18.5) 68; grouping 
correction to ML estimator of scale para- 
meter, (Exercise 18.25) 72; ordered LS 
estimation of location and scale para- 
meters, (Exercises 19.11-12) 96-7; con- 
fidence limits for location parameter, 
(Exercise 20.16) 132; UMP tests for loca- 
tion parameter, (Example 22.6) 170; satisfies 
condition for two-sided BCR, 174; UMP 
test without single sufficient statistic, (Ex- 
amples 22.9-10) 176-8, (Exercise 22.11) 
185; UMP similar one-sided tests of com- 
posite H, for scale parameter, (Example 
23.9) 198; independence of two statistics, 
(Exercise 23.10) 220; non-completeness 
and similar regions, (Exercise 23.15) 221; 
UMPU test of composite Hy, for scale 
parameter, (Exercise 23.24) 222; UMP 
similar test of composite Hy for location 
parameter, (Exercise 23.25) 222; LR tests, 
245, (Exercises 24.11-13, 24.16, 24.18) 
259-60; test of fit on random deviates, 
(Examples 30.2-4) 432-9, (Exercise 30.6) 
462; use of scores, 502; truncation and 
censoring, 526, (Exercises 32.16—-18) 533-4; 
inverse sampling, 595; sequential test for 
scale parameter, (Exercise 34.14) 622. 

Exponential family of distributions, 12; attain- 
ability of MVB, 15; as characteristic form 
of distribution admitting sufficient statis- 
tics, 26, 28; sufficient statistics distributed 
in same form, (Exercise 17.14) 33; andcom- 
pleteness, 190-1; UMPU tests for, 207-17; 
with range dependent on parameters, 
(Exercise 24.17) 260. 

Extreme-value distribution, ML estimation in, 
(Exercise 18.6) 68; censored data, 527. 

Ezekiel, M. J. B., confidence intervals for 
multinormal multiple correlation, 342. 


F distribution, non-central, 252-4. 

Fechner, G. 'T’., work on rank correlation, +78. 

Feller, W., similar regions, 188-9, (Exercises 
23.1-2) 219; distribution of Kolmogorov 
test statistic, 452. 

Fereday, F., linear functional relations in 
several variables, 394, (Exercises 29.6—8) 
416-17. 

Ferguson, 'T. S., rejection of outliers, 528-9. 
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Féron, R., characterization of bivariate nor- 
mality, 353 footnote. 

Fiducial inference, generally, 134-8; in ‘‘ Stu- 
dent’s”’ distribution, 138-9, (Exercise 
21.7) 159; in problem of two means, 148- 
150; critical discussion, 152-8; paradoxes, 
154-5; concordance with Bayesian infer- 
ence, 155-7. 

Fiducial intervals, probability, see Fiducial in- 
ference. 

Fieller, E. C., difficult confidence intervals, 109, 
126. 

Finney, D. J., ML estimation in lognormal dis- 
tribution, (Exercises 18.7-9, 18.19-20) 
68-71; truncated binomial distribution, 
527; tables of exact test of independence in 
2 x 2 table, 553, 555; inverse sampling, 595. 

Fisher, F. M., Cauchy location estimation, 50. 

Fisher, R. A. (Sir Ronald), definition of LF, 
8 footnote; sufficiency, 22; partition of 
error in inefficient estimator, (Exercise 
17.11) 32; ML principle, 35; successive 
approximation to ML estimator, 49, (Ex- 
ample 18.10) 50; use of LF, 62; ML 
estimation of location and scale parameters, 
62; efficiency of method of moments, 66, 
(Exercise 18.17) 70, (Exercise 18.27) 72; 
centre of location of 'T'ype IV distribution, 
(Exercise 18.16) 70; consistency, 92 foot- 
note, 262; fiducial inference, 134; fiducial 
solution to problem of two means, 149; 
fiducial paradoxes, 155; fiducialintervals for 
future observations, (Exercise 21.7) 159; ex- 
haustiveness and sufficiency, 193 footnote; 
ancillary statistics, 217; non-central y? and 
F distributions, 229, 253; transformation 
of correlation coefficient, 295; tests of cor- 
relation and regression, 299; distribution 
of intra-class correlation, 304, (Exercise 
26.14) 315; distribution of partial corre- 
lations, 333; distributions of multiple 
correlations, 338-9, (Exercises 27.13-16) 
344; orthogonal polynomials, 359-60; 
test of difference between regressions, 
(Exercise 28.15) 371; distribution of X? 
test of fit, 424, 428, (Exercise 30.1) 462; 
tests using normal scores, 487, 498; test of 
symmetry, 508; exact treatment of 2 x 2 
table, 550, 552; dispersion tests, 579. 

Fit, see Tests of fit. 

Fix, E., tables of non-central y?, 231; linearity 
of regression, 416; tables of Wilcoxon test, 
494, 
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Fourfold table, see 2 x 2 table. 

Fourgeaud, C., characterization of bivariate 
normality, 353 footnote. 

Fox, K. A., confidence intervals for multi- 
normal multiple correlation, 342. 

Fox, L., matrix inversion, 78. 

Fox, M., charts of power function for linear 
hypothesis, 253. 

Fraser, D. A. S., tolerance intervals for a nor- 

_mal distribution, 130; fiducial inference, 
155, 157, 158; minimal sufficiency, 194 
footnote; runs test to supplement X®? test 
of fit, 442, (Exercise 30.7) 462; rank tests, 
487; tolerance regions, 521. 

Fréchet, M., linear regressions and correlation 
ratios, (Exercise 26.24) 316. 

Freeman, G. H., exact test of independence in 
rxc tables, 559. 

Freund, R. J., two-stage LS, (Exercise 28.24) 
374. 

Functional and structural relations, 278; gener- 
ally, 375-418; notation, 375-6; linear, 
376-7; and regression, 376-80; ML esti- 
mation, 379-88, 409; geometrical inter- 
pretation, 385, 410; confidence intervals 
and tests, 389-92; several variables, 392-4; 
use of product-cumulants, 395-8, 411-12; 
instrumental variables, 398-408, 412; use 
of ranks, 406-8; controlled variables, 408- 
409, 413; curvilinear relations, 410-13. 


Gabriel, K. R., multiple comparisons in r x c 
tables, 561. 

Gafarian, A. V., confidence regions for regres- 
sion lines, 370. 

Galton, F., data on eye-colour, (Example 33.2) 
541. 

Gambler’s ruin, (Example 34.2) 595. 

Gamma distribution, MVB estimation of scale 
parameter, (Exercise 17.1) 31; sufficiency 
properties, (Exercise 17.9) 32; estimation 
of lower terminal, (Exercise 17.22) 34; 
ML estimation of location and scale para- 
meters, (Example 18.17) 64; efficiency of 
method of moments, (Example 18.18) 66; 
confidence intervals for scale parameter, 
(Exercise 20.1) 130; fiducial intervals for 
scale parameter, (Example 21.2) 137; dis- 
tribution of linear function of Gamma vari- 
ates, (Exercise 21.8) 159; non-existence of 
similar regions, (Exercise 23.2) 219; com- 
pleteness and a characterization, (Exercise 
23.27) 223; connexions with rectangular 


_ INDEX 


distribution, 236-7; ARE of Wilcoxon 
test, (Exercise 31.18) 512; truncation and 
censoring, 527. 

Gardner, R. S., tables of limits for a Poisson 
parameter, 118. 

Garwood, F., tables of confidence intervals for 
the Poisson parameter, 118. 

Gastwirth, J. L., robustness, 469; efficient 
estimation from order-statistics, 524; MV 
estimators and rank tests, 524; rank tests 
for censored samples, 527. 

Gauss, C., originator of LS theory, 79, 82. 

Gayen, A. K., studies of robustness, 465-8, 504. 

Geary, R. C., ‘‘ close ”’ estimators, 8; asymptotic 
minimization of generalized variance by 
ML estimators, 56; functional and struc- 
tural relations, 395, 410, 413, (Exercises 
29.2, 29.15) 416, 418; testing normality, 
461; robustness, 465-7. 

Gehan, E. A., Wilcoxon test for censored 
samples, 527. 

Geisser, S., fiducial inference, 157. 

General linear hypothesis, see Linear model. 

Generalized variance, 56; minimized asymp- 
totically by ML estimators, 56; minimized 
in linear model by LS estimators, 81-2. 

Ghosh, J. K., completeness, 190; invariance 
and sufficiency, 256, 614. 

Gibbons, J. D., non-normality and Sign test, 
515. 

Gibson, W. M., estimation of functional re- 
lation, 405. 

Gilby, W. H., data on clothing and intelligence, 
(Example 33.7) 558. 

Girshick, M. A., decision functions, 620; 
sequential analysis, (Exercises 34.8, 34.9, 
34.20) 621-4. 

Glass, D. V., Durbin’s test of diagonal propor- 
tionality in r x r table, (Exercise 33.25) 589. 

Gnanadesikan, M., logistic estimators, (Exercise 
18.29) 73; censored Gamma samples, 527. 

Godambe, V. P., generalization of MVB, 
(Exercise 18.32) 74. 

Goen, R. L., truncated Poisson distribution, 
526, (Exercise 32.19) 534. 

Goheen, H. W., tetrachoric correlation, 307. 

Goldberger, A. S., two-stage LS (Exercise 
28.24) 374. 

Goldman, A. S., double sampling, 619. 

Goodman, L. A., tolerance intervals, 521; 
association in categorized data, 545, 561, 
565-6, (Exercise 33.11) 586; 2x2 tables, 
555; interactions in multi-way tables, 583. 
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Goodness-of-fit, 419; see 'Tests of fit. 

Govindarajulu, Z., power of Wilcoxon test, 
498; bibliography of life-testing, 526; 
censored double exponential, 527. 

Gram-Charlier Series, Type A, efficiency of 
method of moments, (Exercise 18.28) 73. 

Graybill, F. A., quadratic forms in non- 
central normal variates, 229; double samp- 
ling for normal variance, 619. 

Greenberg, B. G., order-statistics, 513; cen- 
sored samples, 525-7, (Exercise 32.25) 535. 

Greenhouse, S. W., tables of studentized maxi- 
mum absolute deviate, 529. 

Greenwood, M., data on inoculation, (Example 
33.1) 538. 

Grouping, corrections to ML estimators, 
(Exercises 18.24—-5) 72; for instrumental 
variables, 399-405, 412-13; truncated and 
censored data, 525. 

Grubbs, F. E., tests for outliers, 529. 

Grundy, P. M., grouped truncated and cen- 
sored normal data, 525. 

Guenther, W. C., unbiassed tests for a normal 
variance, 213. 

Guest, P. G., orthogonal polynomials, 360. 

Guilford, J. P., tetrachoric correlation, 307. 

Gumbel, E. J., choice of classes for X? test of 
fit, 432. 

Gupta, S. S., logistic estimators, (Exercise 
18.29) 73; logistic tables, (Exercise 32.12) 
533. 

Gurland, J., estimation in negative binomial 
and Neyman Type A distributions, (Exer- 
cise 18.28) 73; estimation of functional and 
structural relations, 383, 408. 

Guttman, I., tolerance intervals, 130. 


H,, hypothesis tested, null hypothesis, 163. 

H,, alternative hypothesis, 163. 

Hader, R. J., censored exponential samples, 
526. 

Hajek, J., problem of two means, 147. 

Hajnal, J., sequential two-sample t-test, 614. 

Hald, A., truncated normal distribution, 525. 

Haldane, J. B. S., cumulants of a ML estima- 
tor, 46; mean of modified MCS estimator, 
(Exercise 19.14) 97; standard error of X? 
in rxc tables, 560, (Exercise 33.9) 585; 
approximation to XX” for 2xc_ table, 
(Exercises 33.21—2) 588-9; inverse samp- 
ling, 595. 

Hall, W. J., invariance and sufficiency, 256, 614. 

Halmos, P. R., sufficiency, 24. 
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Halperin, M., confidence intervals for non- 
linear functions, 120, (Exercise 28.3) 370; 
linear functional relation, 405; ML estima- 
tion in censored samples, 524, (Exercise 
32.15) 533; truncated normal distribution, 
525; confidence intervals for censored 
samples, 527; Wilcoxon test for censored 
samples, 527; tables of studentized maxi- 
mum absolute deviate, 529. 

Halton, J. H., exact test of independence in 
r xc tables, 559. 

Hamdan, M. A., choice of classes for X® test, 
439; polychoric estimation, 561. 

Hamilton, M., nomogram for tetrachoric cor- 
relation, 307. 

Hamilton, P. A., tables of confidence intervals 
for a normal variance, 118. 

Hammersley, J. M., estimation of restricted 
parameters, (Exercise 18.21-—2) 71. 

Hannan, E. J., ARE for non-central x? variates, 
275; regressors, 346. 

Hannan, J. F., multinomial biserial methods, 
(Exercise 26.12) 312; power of test in 2 x 2 
tables, 555. 

Harkness, W. L., power of test in 2 x 2 tables, 
555. 

Harley, B. I., estimation of a function of normal 
correlation parameter, 294. 

Harter, H. L., shortest confidence intervals, 
117; ratio of normal ranges, (Exercise 
23.14) 221; tables of normal scores, 502; 
censored normal samples, 525; censored 
exponential samples, 526; censored log- 
normal samples, 527; tables of studentized 
range, 529. 

Hartley, H. O., charts of power function for 
linear hypothesis, 253; confidence regions 
for non-linear regression, (Exercise 28.3) 
370; iterative solution of ML equations for 
incomplete data, 524; tables of studentized 
range, 529. 

Hayes, J. G., matrix inversion, 78. 

Hayes, S. P., Jr., nomogram and tables for 
tetrachoric correlation, 307. 

Haynam, G. E., power of Wilcoxon test, 498. 

Healy M. J. R., comparison of predictors, 
(Exercise 28.22) 374. 

Healy, W. C., Jr., double sampling, 619. - ~ 

Hemelrijk, J., power of Sign test, 514 footnote. 

Hill, B. M., estimating proportions in mixtures 
of distributions, (Exercise 17.2) 31; ML 
estimation in lognormal distribution, 
(Exercise 18.23) 71-2. 
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Hodges, J. L., Jr., efficiency of tests, 265; tables 
of Wilcoxon test, 494; ARE of Wilcoxon 
test, 497; power of Wilcoxon test, 498; 
robust estimation, 502; Wilcoxon and 
normal scores tests ARE, (Exercise 31.23) 
512; minimum ARE of Sign test, (Exercise 
32:1) $31; 

Hoeffding, W., efficiency of LR and X? tests in 
multinomial samples, 276, 440; permuta- 

_ tion and normal-theory distributions, 475; 
joint distribution of rank correlations, 481; 
test of independence, 483; optimum pro- 
perties of tests using normal scores, 487, 
498; asymptotic distribution of expected 
values of order-statistics, 487; distribution 
of Wilcoxon test, 494. 

Hoel, P. G., confidence regions for regression 
lines, 365, 368, 370, (Exercise 28.13) 371. 

Hogben, D., moments of non-central t, 255. 

Hogg, R. V., completeness of sufficient statis- 
tics, 191; completeness, sufficiency and 
independence, (Exercises 23.8—9) 220; tests 
in k samples, (Exercise 24.6) 258, (Exercise 
24.13) 261, 506; LR tests when range 
depends on parameter, 236, (Exercises 
24.8-9) 258-9; sufficiency in exponential 
family when range a function of para- 
meters, (Exercise 24.17) 260. 

Homogeneity tests, binomial and Poisson, 578— 
580. 

Hooker, R. H., data on weather and crops, 
(Example 27.1) 329. 

Hora, R. B., fiducial inference, 155. 

Hotelling, H., estimation of functions of normal 
correlation parameter, 294; variance of 
multiple correlation coefficient, 342; con- 
fidence region for regression line, 365, 
368-9, (Exercise 28.13) 371; comparison 
of predictors, (Exercise 28.22) 374; robust- 
ness, 469; efficiency of rank correlation 
test, 482. 

Hoyland, A., robust estimation, 502. 

Hsu, P., test of independence in 2 x2 tables, 
503, 555, 

Hsu, P. L., optimum property of LR test for 
linear hypothesis, 256. 

Huber, P. J., robustness of location estimators, 
469. 

Hudson, D. J., fitting segmented curves by LS, 
356. 

Hutchinson, D. W., tables of binomial and 
Poisson confidence intervals, 120. 

Huyett, M. J., censored Gamma samples, 527. 
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Huzurbazar, V. S., uniqueness of ML estima- 
tors, 36-7, 52-3; consistency of ML 
estimators, 41; confidence intervals when 
range depends on parameter, (Exercises 
20.13-16) 132. 

Hypergeometric distribution, inverse sampling, 
595. 

Hypotheses, statistical, 161, 186; parametric 
and non-parametric, 162; simple and com- 
posite, 162-3, 186; degrees of freedom, 
constraints, 163; critical regions and alter- 
native hypotheses, 163-4; null hypothesis, 
163 footnote; see Tests of hypotheses. 

Hyrenius, H., ‘“‘ Student’s”’ ¢ in compound 
samples, 467. 


Identical categorizations, 536. 

Identical errors, 351. 

Identifiability, (Example 18.4), 42; in structural 
relations, 379-82, 391, 394. 

Incidental parameters, 383. 

Independence, proofs using sufficiency and 
completeness, (Exercises 23.6—10) 219-20; 
and correlation, 278-9; tests of, 473-83; 
and association, 537 footnote; frequencies, 
538-9. 

Information, amount of, 10; matrix, 28. 

Instrumental variables, 398-408, 412-13. 

Interactions, in multi-way tables, 583. 

Interdependence, 278-9. 

Intersection, of polynomial regressions, 365. 

Interval estimation, 98-160. 

Intra-class correlation, 302-4, (Exercise 26.14) 
315; in multinormal distribution, (Exer- 
cise 27.17) 345. 

Invariant tests, 242, 256-7; and rank tests, 476, 
483, 492. 

Inverse sampling, 595. 

Inversions of order, 478. 

Irwin, J. O., truncated Poisson distribution, 
(Exercises 32.23-4) 535; outlying obser- 
vations, 528; components of X?, 577. 

Iwaskiewicz, K., tables of power function of 
*“ Student’s ”’ t-test, 255. 


Jackson, J. E., bibliography of sequential 
analysis, 620. 

Jacobson, J. E., tables of Wilcoxon test, 494. 

James, G. S., tables for problem of two means, 
148. 

Jeffreys, Sir Harold, Bayesian intervals, 150, 
154. 
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Jenkins, W. L., nomogram for tetrachoric cor- 
relation, 307. 

Jochems, D. B., two-stage LS, (Exercise 28.24) 
374. 

Jogdeo, K., tests using random normal deviates, 
487. 

Johns, M. V., Jr., efficient estimation from 
order-statistics, 524. 

Johnson, N. L., ‘‘ close’ estimators, 8; non- 
central t-distribution, 255; non-central ¥? 
and Poisson distributions, (Exercise 24.19) 
261; probability integral transformation 
with parameters estimated, 443, (Exercise 
30.10) 463 ; bibliography of order-statistics, 
513; SPR tests, 603, (Exercise 34.5) 621; 
review of sequential analysis, 620. 

Jonckheere, A. R., k-sample test, 505. 

Jowett, G. H., estimation of functional relation, 
405. 


k-sample tests, (Examples 24.4, 24.6) 234, 244; 
237-9; (Exercises 24.4-—7) 258; (Exercises 
24.11-13) 259-60; 503-6. 

Kabe, D. G., adjustment of regression analysis, 
356. 

Kac, M., comparison of X? and Kolmogorov 
tests of fit, 460; tests of normality, 461. 

Kale, B. K., iterative ML estimation, 49, 60. 

Kasten, E. L., charts for test of independence 
in 2 x 2 tables, 553. 

Kastenbaum, M. A., hypotheses in multi-way 
tables, 583. 

Katti, S. K., estimation in negative binomial 
and Neyman Type A distributions, (Exer- 
cise 18.28) 73. 

Katz, L., power of test in 2 x 2 tables, 555. 

Katz, M., Jr., minimal sufficiency, 194 footnote. 

Kavruck, S., tetrachoric correlation, 307. 

Kemperman, J. H. B., tolerance regions, 521. 

Kempthorne, O., Tang’s tables, 253. 

Kendall, M. G., geometrical interpretation of 
LS theory, 80; estimation of a function of 
normal correlation parameter, 294; linear 
regression with identical observation errors, 
415, (Exercise 29.13) 418; distributions of 
rank correlations, 477-81; ties in rank 
correlation, 509; rank measure of associa- 
tion, 564-5. 

Keynes, J. M., characterizations of distribu- 
tions by forms of ML estimators, (Exer- 
cises 18.2—3) 67-8. 

Kiefer, J., bounds for estimation variance, 16-— 
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17; non-existence of a ML estimator, 
(Exercise 18.34) 74; consistency of ML 
estimators, 385, 388; comparison of X? 
and Kolmogorov tests of fit, 460; tests of 
normality, 461; distribution of Hoeffding’s 
test, 483. 

Kimball, A. W., computation of X? partitions, 
577. 

Kimball, B. F., ML estimation in extreme- 
value distribution, (Exercise 18.6) 68. 
Klett, G. W., tables of confidence intervals for 
a normal variance, 118; shortness of these 

intervals, (Exercise 20.10) 131. 

Klotz, J., distribution of normal scores test, 
498; test for scale-shift, 503; tests of 
symmetry, 508. 

Knight, W., inverse sampling, 595. 

Kolmogorov, A., test of fit, 452-61. 

Kolodzieczyk, S., general linear hypothesis, 
249; tables of power function of “‘ Stu- 
dent’s ”’ t-test, 255. 

Kondo, T., standard error of X? in r xc table, 
560. 

Konijn, H. S., ARE of tests, 267; linear trans- 
formations of independent variates, 482. 

Koopman, B. O., distributions with sufficient 
statistics, 26, 28. 

Kramer, K. H., confidence intervals for multi- 
normal multiple correlation, 342. 

Krishnan, M., Type M unbiassed tests, 202; 
doubly non-central ¢, 255. 

Kruskal, W. H., geometry of LS, 81; linear re- 
gressions, correlation coefficients and ratios, 
(Exercise 26.24) 316; history of rank cor- 
relation, 478; Wilcoxon test, 493-4, (Exer- 
cise 31.15) 511; k-sample test, 504; ties, 
509; association in categorized data, 545, 
561, 565-6, (Exercise 33.11) 586. 

Kudo, A., rejection of outliers, 529. 

Kulldorff, G., censored exponential samples, 
526. 

Kurtic curve, 347. 


Laha, R. G., linearity of regression, 416. 

Lancaster, H. O., continuity corrections when 
pooling X? values, 556; polychoric estima- 
tion, 561; canonical correlations and trans- 
formation to bivariate normality, 568-9; 
components of X?, 575, (Example 33.13) 
576, 577, (Exercises 33.15-18) 587-8; 
multi-way tables, (Example 33.14) 581, 
583, (Exercise 33.8) 585, (Exercise 33.27—8) 
589-90. 
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Laplace transform, 191 footnote. 

Latscha, R., tables of exact test of independence 
in 2 x 2 tables, 553, 555. 

Laurent, A. G., censored exponential samples, 
526. 

Lawley, D. N., approximations to distribution 
of LR statistic, 233. 

Lawton, W. H., problem of two means, 147. 

Least Squares (LS) estimators, 75-91, (Exer- 

~ cises 19.1-8) 95-6; and ML equivalent in 

normal case, 75, 80; LS principle, 75-6; 
linear model, 76-87; unbiassed, 78; dis- 
persion matrix, 78; MV property, 79-80; 
geometrical interpretation, 80-1; mini- 
mization of generalized variance, 81- 
82; estimation of error variance, 82-3; 
irrelevance of normality assumption to 
estimation properties in linear model, 83; 
singular case, 84-7; ordered estimation of 
location and scale parameters, 87-91, 
(Exercises 19.10-12) 96-7; inefficient in 
Poisson case, (Exercise 19.16) 97; general 
linear hypothesis, 250; approximate linear 
regression, 286-7, 325-6; in linear regres- 
sion model, 354-70; use of supplementary 
information, (Exercises 28.17-18) 372; 
adjustment for an extra observation, (Exer- 
cise 28.19) 373; two-stage, (Exercise 28.24) 
374; inr xc tables, 568; see Linear model, 
Regression. 

Lecam, L., superefficiency, 44. 

Legendre polynomials, 444 footnote. 

Lehmann, E. L., example of absurd unbiassed 
estimator, 5, (Exercise 17.26) 34; testing 
hypotheses, 161; optimum test property 
of sufficient statistics, 187; completeness, 
190; completeness of linear exponential 
family, 191; minimal sufficiency, 194; 
completeness and similar regions, 196; 
problem of two means, (Example 23.10) 
199; non-similar tests of composite Ho, 
200; unbiassed tests, 206; UMPU tests 
for the exponential family, 207; non- 
completeness of Cauchy family, (Exercise 
23.5) 219; minimal sufficiency for binomial 
and rectangular distributions, (Exercises 
23.11-12) 220; a UMPU test for normal 
distribution, (Exercise 23.16) 220; UMPU 
tests for Poisson and binomial distribu- 
tions, (Exercises 23.21—2) 222; UMP tests 
for exponential and rectangular distribu- 
tions, (Exercises 23.25-6) 222-3; a useless 
LR test, (Example 24.7) 246; optimum 
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properties of LR tests, 255; asymptotically 
UMP tests for a normal mean, (Example 
25.1) 263; efficiency of tests, 265; UMP 
invariant tests for correlation and multiple 
correlation coefficients, 295, 342; con- 
fidence intervals in regression, (Exercise 
28.20) 373; distribution of X? test of 
fit, 428, 430; completeness of order- 
statistics, 472 footnote; unbiassed rank 
tests of independence, 483; rank tests, 487; 
distribution of Wilcoxon test, 494; con- 
fidence intervals based on Wilcoxon tests, 
494, (Exercises 31.24—-5) 512; consistency 
of Wilcoxon test, 495; ARE of Wilcoxon 
test, 497; power of Wilcoxon test, 498; 
robust estimation, 502; unbiassed two- 
sample test, 503; tests of symmetry, 507; 
Wilcoxon and normal scores tests ARE, 
(Exercise 31.23) 512; optimum proper- 
ties of Sign test, 514; minimum ARE of 
Sign test, (Exercise 32.1) 531; Sign test 
asymptotically most powerful location 
test for double exponential, (Exercise 
32.5) 531; sequential completeness, 614 
footnote. 

Lehmer, E., tables of power function for linear 
hypothesis, 253. 

Lewis, B. N., multi-way tables, 580. 

Lewis, P. A. W., tables of test of fit, 452. 

Lewis, T’. O., singular LS estimation, 86. 

Lewontin, R. C., ML estimation of number of 
classes in a multinomial, (Exercise 18.10) 
69. 

LF, see Likelihood function. 

Lieberman, G. J., tables of non-central t-dis- 
tribution, 254. 

Life testing, 526. 

Likelihood equation, 36; see Maximum Likeli- 
hood. 

Likelihood function (LF), 8; use of, 62, see 
Maximum Likelihood. 

Likelihood Principle, 217. 

Likelihood Ratio (LR) tests, normal distribu- 
tion, (Example 22.8) 174, generally, 224- 
247; and ML, 224-5, 241-2; not necessarily 
similar, 225-6; approximations to distri- 
butions, 227, 230-4; asymptotic distribu- 
tion, 230-1; asymptotic power and tables, 
231-2, 253-4; when range depends on 
parameter, 236-40, (Exercises 24.8-9) 
258-9; properties, 240-7; consistency, 231, 
240-1; unbiassedness, 240-2, 245-6, 
(Exercises 24.14-18) 260; other properties, 
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245-6; a useless test, (Example 24.7) 246; 
for linear hypotheses, 249-51; power func- 
tion, 253-4; optimum properties, 255-6; 
efficient in multinomial samples, 276; in 
normal correlation and regression, 299-— 
301, 338; of fit, 420-3, (Exercise 30.11) 
463; of independence in 2x2 tables, 
547-9; in r xc tables, 561. 

Lindeberg, J. W., early work on rank corre- 
lation, 478. 

Lindley, D. V., grouping corrections to ML 
estimators, (Exercises 18.24-5) 72; tables 
of confidence intervals for a normal vari- 
ance, 118; concordance of fiducial and 
Bayesian inference, 155~7, (Exercise 21.11) 
160; choice of test size, 183; conditional 
tests, 218; inconsistency of ML estimators 
in functional relation problem, 385, 387; 
controlled variables, 408; observational 
errors and linearity of regressions, 414, 
416, (Exercises 29.12, 29.14) 418. 

Linear model, 76-87; tests requiring normality 
assumption, 83-4; general case, 87, (Exer- 
cises 19.2-3, 19.5, 19.9) 96; general 
linear hypothesis, 247; canonical form, 
248; LR statistic in, 249; and LS theory, 
250; power function of LR test, 253-4; 
optimum properties of LR test, 255-6; and 
regression, 354-70; meaning of “‘ linear,” 
355-6; confidence intervals and tests for 
parameters, 362-5, (Exercises 28.15-22) 
372-4; confidence regions for a regression 
line, 365-70; supplementary information 
in regression, (Exercises 28.16—-18) 372; 
adjustment for an extra observation, 
(Exercise 28.19) 373; see Least Squares, 
Regression. 

Linear regression, see Regression. 

Linnik, Yu V., problem of two means, 200. 

Lipps, G. F., early work on rank correlation, 
478. 

Lloyd, E. H., ordered LS estimation of location 
and scale parameters, 87, (Exercise 19.10) 
96. 

Location, centre of, 64. 

Location and scale parameters, ML estimation 
of, 62-4; ML estimators uncorrelated for 
symmetrical parent, 63-4; LS estimation 
by order-statistics, 87-91, (Exercises 19.10- 
12) 96-7; ordered LS estimators un- 
correlated for symmetrical parent, 89-90; 
completeness, 190; no test for location 
with power independent of scale parameter, 
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(Exercise 23.30) 223; unbiassed invariant 
tests for, 242-5, (Exercise 24.15) 260; 
estimation of, in testing fit, 443, 452, 
(Exercise 30.10) 463. 

Location-shift alternative, 488. 

Locks, M. O., tables of non-central ¢, 255. 

Logistic distribution, MVB, (Exercise 17.5) 32; 
ML estimation in, (Exercise 18.29) 73; 
ARE of Wilcoxon test, (Exercise 31.17) 
511; c.f. and cumulants of order-statistics, 
(Exercise 32.12) 532. 

Lognormal distribution, ML estimation in, 
(Exercises 18.7-9, 18.19-20, 18.23), 68- 
5 

LR, see Likelihood Ratio. 

LS, see Least Squares. 

Lyons, 'T’. C., tetrachoric correlation, 307. 


McCornack, R. L., tables of Wilcoxon sym- 
metry test, 508. 

McKay, A. T., distribution of extreme deviate 
from mean, (Exercise 32.21) 534. 

McKendrick, A. G., estimation in censored 
samples, (Exercise 32.24) 535. 

MacKinnon, W. J., tables of Sign test and con- 
fidence intervals for median, 514, 518. 

MacStewart, W., power of Sign test, 515. 

Madansky, A., shortest confidence intervals, 
117; functional and structural relations, 
383, 396, 405, (Exercise 29.16) 418; toler- 
ance intervals, 521; multi-way tables, 
(Exercises 33.24, 33.30) 589, 591. 

Maitra, A. P., sufficiency, 28. 

Mann, H. B., Tang’s tables, 253; choice of 
classes for X? test of fit, 432, 438-9; un- 
biassedness of X? test, 434; variance of 
X?, (Exercise 30.3) 462; rank correlation 
test for trend, 483-4, (Exercise 31.8) 510; 
Wilcoxon test, 493. 

Mantel, N., confidence intervals for non- 
linear functions, 128. 

Maritz, J. S., biserial correlation, 311. 

Marsaglia, G., quadratic forms in non-central 
normal variates, 229; multinormal regres- 
sion, (Exercise 27.21) 345. 

Massey, F. J., Jr., tables of Kolmogorov’s-test 
of fit, 457; power of Kolmogorov test, 
459-60, (Exercise 30.16) 464. 

Mauldon, J. G., fiducial paradox, 155. 

Maximum Likelihood (ML) estimators, in 
general, 35-74; ML principle, 35; and 
MVB estimation, 36-7, (Exercise 18.32) 74; 
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Maximum Likelihood—cont. 
and sufficiency, 36, 37, (Example 18.5) 43, 
52; uniqueness in presence of sufficient 
statistics, 36-7, 52-3; not generally un- 
biassed, (Example 18.1) 38, 42, (Example 
18.11) 53; large-sample optimum proper- 
ties, 38; consistency and inconsistency, 
39-42, 55, (Example 18.16) 61, (Exercise 
18.35) 74, 385, 387-8; cases of indeter- 

~minacy, (Example 18.4) 42, (Exercises 

18.17, 18.23, 18.33-5) 70-4; efficiency and 
asymptotic normality, 42-6, 55-6; asymp- 
totic variance equal to MVB, 43-4; simpli- 
fication of asymptotic variance and dis- 
persion matrix in presence of sufficiency, 
46, 55; cumulants of, 46-8; successive 
approximations to, 48-51; of several 
parameters, 52-60; asymptotic minimiza- 
tion of generalized variance, 55-6; non- 
identical parents, 60-1; of location and 
scale parameters, 62—5; characterization of 
parents having ML estimators of given 
form, (Exercises 18.2—3) 67-8; of para- 
meters restricted to integer values, (Exer- 
cises 18.21-—2) 71; corrections for grouping, 
(Exercises 18.24—5) 71-2; and LS equiva- 
lent in normal case, 75, 80; in structural 
and functional relation 379-85, 410-11; 
choice of, in testing fit, 428-30; for trun- 
cation and censoring problems, 523-4; 
in sequential tests, (Exercise 34.21) 624. 

MCS, see Minimum Chi-Square. 

Mean-square-error, estimation by minimizing, 
21-2, (Exercise 17.16) 33. 

Medial correlation, (Exercise 31.7) 510. 

Median, Sign test for, 514-17, (Exercise 32.1) 


531; confidence intervals for, 518, (Exer- 


cise 32.4) 531. 

Mendelian pea data, (Example 30.1) 422. 
Mendenhall, W., censored exponential samples, 
526; bibliography of life testing, 526. 
Merrington, M., power functions of tests of 

independence in 2 x2 tables, 555. 
Method of moments, see Moments. 
Mickey, M. R., problem of two means, 147. 
Mikulski, P. W., efficiency of rank-order tests, 
502. 
Miller, L. H., tables of Kolmogorov test of fit, 
457. 
Miller, R. G., Jr., bias-corrected estimation, 6. 
Minimal sufficiency, see Sufficiency. 
Minimum Chi-Square estimators, 92-5; modi- 
fied, 93; asymptotically equivalent to ML 
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estimators, (Exercise 19.13) 97; mean and 
variance, (Exercise 19.14) 97. 

Minimum mean-square-error, see Mean-square- 
error. 

Minimum Variance (MV) estimators, 8-19; 
and MVB estimators, 12; unique, 17-18; 
and sufficient statistics, 25-6, (Exercise 
17.24) 34; among unbiassed linear com- 
binations, 79-80; and ML estimation, 
37-8, (Exercise 18.13) 70; uniqueness and 
completeness, 190; in truncation and cen- 
soring problems, 524. 

Minimum Variance Bound (MVB), 9; condi- 
tion for attainment, 10; for particular 
problems, (Examples 17.6-10) 10-11; 
MYVB estimation and MV estimation, 12; 
improvements to, 12-17; asymptotically 
attained for any function of the parameter, 
15-16; and sufficiency, 24; smaller than 
variance bound when several parameters 
unknown, (Exercise 17.20) 33; relaxation 
of regularity conditions for, (Exercises 
17.21—2) 33-4; analogues, (Exercise 17.23) 
34; and ML estimation, 36-8, 43-4, 
(Exercise 18.4) 68; generalization, (Exer- 
cise 18.32) 74; asymptotic improvements 
in non-regular cases, (Exercises 32.8-11) 
531-2; in sequential sampling, 614-15. 

Mises, R. von, test of fit, 451. 

Mitra, S. K., truncated Poisson, (Exercise 32.20) 
534; models for the r x c table, 559; hypo- 
theses in multi-way tables, 582-3. 

Mixtures of distributions, MVB for propor- 
tions, (Exercise 17.2) 31. 

ML, see Maximum Likelihood. 

Moments, method of, efficiency, 65-7, (Exer- 
cises 18.9, 18.17, 18.20, 18.27-8) 69-73. 

Mood, A. M., two-sample tests, 503; median 
test of randomness, (Exercise 31.12) 511; 
critical values for Sign test, 514. 

Moore, A. H., censored normal samples, 525; 
censored lognormal samples, 527. 

Moore, P. G., estimation using best two order- 
statistics, (Exercise 32.14) 533. 

Moran, P. A. P., distribution of multiple cor- 
relation coefficient, 339. 

Moses, L. E., scale-shift tests, 503. 

Moshman, J., double sampling, 619. 

Multinomial distribution, successive approxi- 
mation to a ML estimator, (Example 
18.10) 50; ML estimation of number of 
classes when all probabilities equal, (Exer- 
cise 18.10) 69; efficiency of LR tests, 276; 
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as basis of tests of fit, 420-30; tests of fit 
on pea data, (Example 30.1) 422; homo- 
geneity test, 589. 

Multinormal distribution, variance of a quad- 
ratic form, (Exercise 19.3) 96; case where 
single sufficient statistic exists without 
UMP test, (Example 22.11) 178; single 
sufficient statistic for two parameters, 
(Example 23.3) 193; unbiassedness of 
LR tests of independence, 245; biserial 
methods, (Exercise 26.12) 314; partial 
correlation and regression, 315-22, 332-3, 
(Exercises 27.4, 27.6, 27.21) 343, 345; in- 
variance of independence under orthogo- 
nal transformation, 333, (Exercise 27.7) 
343; multiple correlation, 338-42; with 
intra-class correlation matrix, (Exercises 
27.17—-18) 345; limiting value of X?, (Exer- 
cises 33.7—-8) 584-5. 

Multiple comparisons, in r x c tables, 561. 

Multiple correlation, 334-42; coefficient, 334— 
335; geometrical interpretation, 335-6; 
screening of variables, 336-7; conditional 
sampling distribution in normal case, 337— 
338, (Exercise 27.13) 344; unconditional 
distribution in multinormal case, 338-42, 
(Exercises 27.14-16) 344; unbiassed esti- 
mation in multinormal case, 342; with un- 
correlated regressors (Exercise 27.20) 345. 

Multi-way tables, 580-3. 

Murphy, R.B., chartsfortoleranceintervals,520. 

MV, see Minimum Variance. 

MYVB, see Minimum Variance Bound. 

Myers, M. H., sequential t-tests, 614. 


Nair, K. R., estimation of functional relation, 
404-5; tables of confidence intervals for 
median, 518; distribution of extreme devi- 
ate from sample mean, (Exercise 32.21) 
534; tables of studentized extreme deviate, 
529. 

Narain, R. D., unbiassed LR tests of independ- 
ence, 245. 

Negative binomial distribution, ML estima- 
tion, (Exercises 18.26—7) 72-3; truncated, 
527; sequential sampling for attributes, 
(Example 34.1) 593. 

Neyman, J., consistency of ML estimators, 61, 
384; BAN estimators, 91-3; confidence 
intervals, 98, 111 footnote; most selective 
intervals, 117-18; intervals for upper ter- 
minal of a rectangular distribution, (Exer- 
cises 20.3-4) 131; tests of hypotheses, 161; 
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maximizing power, 165; testing simple H, 
against simple H,, 166; BCR in tests for 
normal parameters, (Example 22.8) 175; 
UMP tests and sufficient statistics, 177, 
(Exercise 22.11) 185; sufficiency and simi- 
lar regions, 189; bias in test for normal 
variance, (Example 23.12) 203; unbiassed 
tests, 206; sufficiency, similar regions and 
independence, (Exercises 23.3-4) 219; LR 
method, 224; tables of power function of 
t-test, 255; LR tests in k normal samples, 
(Exercises 24.4-6) 258; incidental and 
structural parameters, 383; consistent esti- 
mation of structural relation, 385, 400; 
consistency of X? test of fit, 434; 
“smooth ”’ test of fit, 444-6. See Type A. 

Neyman—Pearson lemma, 166, 169; extension, 
208-9. 

Noether, G. E., confidence intervals for ratio 
of binomial parameters, (Exercise 20.9) 
131; ARE, 265; conservativeness of dis- 
tribution-free procedures in discrete cases, 
457, 498. 

Non-central confidence intervals, 102. 

Non-central F-distribution, 252-4. 

Non-central ¢-distribution, 254-5. 

Non-central y? distribution, 227-31, 252 
(Exercises 24.1-3) 257, (Exercise 24.19) 
261; and ARE, 274-5. 

Non-parametric hypotheses, 162; and distribu- 
tion-free methods, 470. 

Normal distribution, estimation of mean, 2, 7; 
MVB for mean, (Example 17.6) 10; MVB 
for variance, (Example 17.10) 11; estima- 
tion efficiency of sample median, (Example 
17.12) 20; estimation efficiency of sample 
mean deviation, (Example 17.13) 21; 
sufficiency in estimating mean and vari- 
ance, (Example 17.17) 27; bounds for 
variance in estimating standard deviation, 
(Exercise 17.6) 32; MV unbiassed estima- 
tion of square of mean, (Exercise 17.7) 32; 
minimum mean-square-error estimation 
of variance, (Exercise 17.16) 33; estimation 
eficiency of mean difference, (Exercise 
17.27) 34; ML estimator of mean un- 
biassed, (Example 18.2) 38; ML estimator 
of standard deviation, (Example 18.8) 46; 
ML estimation of mean and _ variance, 
(Examples 18.11, 18.13) 54, 56; estimation 
of mean restricted to integer values, (Exer- 
cises 18.21—3) 71; grouping corrections to 
ML estimators of mean and _ variance, 
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Normal distribution—cont. 


(Exercise 18.25) 72; ML estimation of 
common mean of populations with differ- 
ent variances, (Exercise 18.30) 73; ML 
estimation of mean functionally related to 
variance, (Exercise 18.31) 74; non-existence 
of a ML estimator, (Exercise 18.34) 74; 
LS and ML equivalent, 75, 80; confidence 
intervals for mean, (Examples 20.1, 20.3, 
20.5) 100, 106, 110; confidence intervals 
for variance, (Example 20.6) 113, (Exercise 
20.6) 131, (Exercise 20.10) 131, (Exercise 
23.18) 221; tables of confidence intervals 
for variance and for ratio of two variances, 
118; confidence intervals for mean with 
variance unknown, 123-5; confidence 
intervals for ratio of two means, (Example 
20.7) 125; confidence regions for mean and 
variance, (Exercise 20.5) 131; confidence 
intervals for ratio of two variances, (Exer- 
cise 20.7) 131, (Exercise 23.18) 222; fidu- 
cial intervals for mean, (Example 21.1) 
136; fiducial intervals for mean with 
variance unknown, 138-9; confidence and 
fiducial intervals for problem of two means, 
139-50, (Exercises 21.4—-5, 21.9-10) 159- 
160, (Example 23.10) 199, (Example 23.16) 
215, (Example 24.2) 226; Bayesian inter- 
vals, 150-2; testing simple H, for mean, 
(Examples 22.21-3) 164-8, 172-3, (Ex- 
amples 22.1214) 180-4, (Exercise 22.12) 
185, (Example 23.11) 201, 202, 212, 
(Examples 25.1-5) 263-74; testing normal 
against double exponential form, (Example 
22.5) 169; testing various hypotheses for 
mean and variance, (Example 22.8) 174; 
testing simple H, for variance, (Exercises 
22.3, 22.5) 184, 212; non-existence of simi- 
lar regions (Example 23.1) 188; testing 
composite H, for mean, (Example 23.7) 
196, 206, (Example 23.14) 213, (Example 
23.17) 218, (Example 24.1) 225; testing 
composite H, for difference between two 
means, variances equal, (Example 23.8) 
197, (Example 23.15) 214; testing com- 
posite H, for variance, (Examples 23.12- 
14) 202, 205, 212, (Exercise 23.13) 221, 
(Examples 24.3, 24.5), 231, 241; testing 
composite H, for linear functions of differ- 
ing means, and for common variance, 
(Example 23.15) 214, (Exercise 23.19) 
222; testing composite H, for weighted 
sum of reciprocals of differing variances, 


(Example 23.16) 214; testing composite 
H, for variance-ratio, (Example 23.16) 214, 
(Exercises 23.14, 23.17-18) 221-2; proofs 
of independence properties using com- 
pleteness, (Exercises 23.8—9) 220; “ pecul- 
iar’? UMPU tests, (Exercises 23.16, 23.21) 
221-2; minimality and single sufficiency, 
(Exercise 23.31) 223; testing equality of 
several variances, (Examples 24.4, 24.6) 
234, 244, (Exercises 24.4, 24.7) 258; in 
general linear hypothesis, 248-56; tables 
of power function of t-test, 255; LR tests 
for k samples, (Exercises 24.4-6) 258; 
asymptotically most powerful tests for the 
mean, (Example 25.1) 263; ARE of 
sample median, (Examples 25.2, 25.4—5) 
267, 271, 274; data on functional relation, 
(Examples 29.1-7) 388, 390, 397, 401, 
405-6, 408; failure of product-cumulant 
method in estimating functional relation, 
396-8, 411; testing normality, 419, 461, 
468; choice of classes for X? test, 431; 
‘* smooth ”’ test for mean, (Example 30.5) 
447; robustness of tests, 465-8; use of 
normal scores in tests, 486-7; choice of 
two-sample test, 488, 491; efficiency of 
Wilcoxon and normal scores tests, 497-8, 
501; paired t-test, 507; truncation and 
censoring, 525-6; criteria for rejecting out- 
lying observations, 528-9; estimation using 
best two order-statistics, (Exercise 32.14) 
533; distribution of extreme deviate from 
sample mean, (Exercise 32.21) 534; 
sequential test of simple Hy for mean, 
(Examples 34.8, 34.10) 608, 610; sequen- 
tial tests for variance, (Example 34.9) 608, 
(Exercises 34.18-20) 624; sequential ¢-test, 
612-14; sequential estimation of mean, 
(Example 34.12) 617; double sampling for 
the mean and variance, 618-19; distribu- 
tion of sample size in sequential sampling, 
(Exercises 34.12-—13) 622; see also Bivariate 
normal, Multinormal. 

Normal scores, 486-7; in c, two-sample test, 
498-503, (Exercise 31.23) 512. 

Norton, H. W., interaction in multi-way 
tables, 583. 

Nuisance parameter, 208; removal of, (Exercise 
30.18) 464. 

Null hypothesis, 613 footnote. 


OC, operating characteristic, 597-8. 
Odell, P. L., singular LS estimation, 86. 
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Ogawa, J., censored exponential samples, 526. 

Ogburn, W. F., data on crime in cities, 
(Example 27.2) 331. 

Olkin, I., unbiassed estimation of normal cor- 
relation parameter, 294-5; unbiassed esti- 
mation of multiple correlation coefficient, 
342. 

Operating characteristic, 597-8. 

Order-statistics, for Cauchy location, (Ex- 
ample 18.9) 50; in LS estimation of loca- 
tion and scale parameters, 87-91; and 
similar regions, 188; completeness, 472; 
use of normal scores, 486-7; asymptotic 
distribution of expected values, 487; Sign 
test for quantiles, 513-17; confidence 
intervals for quantiles, 517-18; tolerance 
intervals, 518-21; in point estimation, 
521-2; truncation and censoring, 522-7; 
outlying observations, 527-30; estimation 
of quantiles, (Exercise 32.13) 532. 

Ordered alternative, for k-sample test, 506. 

Ordered categorization, 536. 

Ordered r x c tables, 562-74. 

Orthogonal, regression analysis, 356; poly- 
nomials in regression, 356-62, (Exercise 
28.23) 374; Legendre polynomials, 444 
footnote. 

Outlying observations, 527-30, (Exercise 32.21) 
534. 

Owen, D. B., tables of power of ‘‘ Student’s ”’ 
t-test, 255. 


Pabst, M. R., efficiency of rank correlation test, 
482. 

Pachares, J., tables of confidence limits for the 
binomial parameter and normal variances, 
118; tables of the studentized range, 529. 

Page, E. S., sequential test in the exponential 
distribution, (Exercise 34.14) 622. 

Paired t-test, 507. 

Parameter, nuisance, 208. 

Parameter-free tests of fit, 443-4, 452, 461, 
(Exercise 30.10) 463. 

Parametric hypotheses, 162. 

Partial association, 541-5, 580-4. 

Partial correlation and regression, 317-33; 
partial correlation, 317; linear partial re- 
gression, 321; relations between quantities 
of different orders, 322—5, (Exercises 27.1-3, 
27.5) 343; approximate linear partial re- 
gression, 325-6; estimation of population 
coefficients, 327; geometrical interpreta- 
tion, 327-9; computations, (Examples 
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27.1-2) 329-32; sampling distributions, 
332-3. 

Partitions of X?, 449-50; in r x c tables, 574-8. 

Patil, G. P., attainability of MVB, 15. 

Patnaik, P. B., non-central y? and F distribu- 
tions, 229, 231, 253-4, (Exercise 24.2) 257; 
variance of X?, (Exercise 30.5) 462; power 
function of test of independence in 2 x 2 
tables, 555. 

Paulson, A. S., tests for outliers, 529. 

Paulson, E., LR tests for exponential distribu- 
tions, 245, (Exercises 24.16, 24.18) 260; 
rejection of outliers, 529. 

Pearson distributions, efficient estimation in, 
65-6, (Example 18.18) 66-7, (Exercise 
18.16) 70. 

Pearson, E. S., charts of confidence intervals 
for the binomial parameter, 118; tests of 
hypotheses, 161; maximizing power, 165; 
testing simple Hy against simple H,, 166; 
BCR in tests for normal parameters, 
(Example 22.8) 175; UMP tests and suffi- 
cient statistics, 177; choice of test size, 
183; UMP test and sufficiency in exponen- 
tial distribution, (Exercise 22.11) 185; bias 
in test for normal variance, (Example 
23.12) 203; unbiassed tests, 206; LR 
method, 224; charts of power function for 
linear hypothesis, 253; LR tests in Rk nor- 
mal samples, (Exercises 24.4—6) 258; tests 
of fit, 445, 452, (Exercise 30.12) 463 ; studies 
of robustness, 465; rejection of outlying 
observations, 528-9; tables of studentized 
range, 529; models for 2 x 2 tables, 550; 
power functions of tests in 2 x 2 tables, 555. 

Pearson, K., development of correlation theory, 
279; tetrachoric correlation, 306; biserial 
correlation, (Table 26.7) 307, 309, 310; 
X* test of fit, 420; estimation using best 
two order-statistics, (Exercise 32.14) 533; 
coefficient of contingency, 557; standard 
error of X* in r xc tables, 560; multi-way 
tables, 580, (Exercise 33.7) 584. 

Peers, H. W., Bayesian and confidence inter- 
vals, 157. 

Permutation tests, 472. 

Pillai, K. C. S., tables of studentized extréme 
deviate, 529. 

Pinkham, R. S., moments of non-central ¢, 255. 

Pitman, E. J. G., “‘ close’ estimation, 8; dis- 
tributions possessing sufficient statistics, 
26, 28, 29; confidence intervals for ratio of 
variances in bivariate normal distribution, 
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Pitman, E. J. G.—cont. 

(Exercise 20.19) 133; sufficiency of LR on 

simple Ho, (Exercise 22.13) 185; unbiassed 

invariant tests for location and scale para- 
meters, 242, (Exercise 24.15) 260; ARE, 

265; ARE of Sign test, (Exercise 25.2) 276, 

516; ARE of Wilcoxon test, (Exercises 

25.3-4) 277, 496, (Exercise 31.22) 512; 

test of independence, 474; two-sample 

test, 489-91; consistency of Wilcoxon test, 

495, (Exercise 31.16) 511. 

Plackett, R. L., on origins of LS theory, 79; 
LS theory in the singular case, 84, (Exer- 
cises 19.6-8) 96; general linear model, 
(Exercise 19.2) 96; continuity corrections, 
508, 556; simplified estimation in censored 
samples, 524; censored normal samples, 
525; truncated Poisson distribution, 526, 
(Exercise 32.20) 534; c.f. and cumulants of 
logistic order-statistics, (Exercise 32.12) 
532; estimation of quantiles, (Exercise 
32.13) 533 ;interaction in multi-way tables, 
583. 

Point-biserial correlation, 311-12, (Exercise 
26.5) 513. 

Point estimation, 1-97. : 

Poisson distribution, MVB for parameter, 
(Example 17.8) 11; absurd unbiassed esti- 
mator in truncated case, (Exercise 17.27) 
34; MCS estimation, (Example 19.11) 93; 
in linear model, (Exercise 19.16) 97; con- 
fidence intervals, (Example 20.4) 106-8; 
tables of confidence intervals, 118; testing 
simple Hy, (Exercise 22.1) 184, 212; 
UMPU tests for difference between two 
Poisson parameters, (Exercise 23.21) 222; 
and non-central y? distribution, (Exercise 
24.19) 261; truncation and censoring, 526, 
(Exercises 32.19-20, 32.22-4) 534-5; dis- 
persion test, 579-80; inverse sampling, 
595; sequential estimation, (Example 
34.13) 618. 

Polychoric estimation of normal correlation, 
561. 

Polynomial regression, 356—70. 

Polytomy, 545. 

Potthoff, R. F., testing homogeneity, 579. 

Power, of a test, 164—5; function, 179; see Tests 
of hypotheses. 

Pratt, J. W., expected length of confidence 
intervals, (Exercises 20.17-18) 133; un- 
biassed estimation of normal correlation 
parameter, 294—5; unbiassed estimation of 
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multinormal multiple correlation, 342; 
robustness of two-sample tests, 502. 

Prediction intervals, 363. 

Press, S. J., confidence interval comparisons for 
problem of two means, 148. 

Price, R., non-central F distributions, 253. 

Probability integral transformation, in tests of 
fit, 443, 451, (Exercise 30.10) 463. 

Prout, T., see Lewontin, R. C. 

Przyborowski, J., confidence limits for the 
Poisson parameter, 118. 

Pseudo-inverse of singular matrix, (Exercise 
27.21) 345. 

Puri, M. L., k-sample tests, 504. 

Putter, J., treatment of ties, 509. 

Pyke, R., spacings, 513. 


Quadratic forms, mean and variance, (Exercises 
19.3, 19.9) 96; non-central y? distribution 
of, 229-30, (Exercise 24.10) 259. 

Quantiles, tests and confidence intervals for, 
513-18. 

Quenouille, M. H., corrections for bias, 5-7, 
(Exercises 17.17-18) 33; fiducial para- 
doxes, 155; random exponential deviates, 
(Example 30.2) 432; partitions of X?, 578. 

Quesenberry, C. P., tests for outliers, 529. 


Raghavachari, M., scale-shift tests, 503. 

Raj, D., truncated and censored Gamma dis- 
tributions, 527. 

Ramachandramurty, P. V., robust estimation, 
502. 

Ramachandran, K. V., tables of confidence 
limits for normal variance and ratio, 118, 
204, (Exercise 23.17) 221. 

Randomness, tests of, 483-7. 

Ranks, as instrumental variables, 406-8, (Exer- 
cises 29.9-10) 417. 

Rank tests, 476; using rank correlation coeffi- 
cients, 476-86; for two- and k-sample 
problems, 492, 506; independence of sym- 
metric functions, (Exercise 32.2) 531. 

Rao, B. R., analogue of MVB, (Exercise 17.23) 
34; censoring, truncation and efficiency, 
524, (Exercise 32.26) 535. 

Rao, C. R., MVB, 9, (Exercise 17.20) 33; 
sufficiency and MV estimation, 25, 28; 
efficiency and correlation, 44; MCS esti- 
mation, 93; second-order estimating effi- 
ciency, 95. 

Ratio, confidence intervals for, see particular 
distributions. 


INDEX 


Rectangular distribution, sufficiency of largest 
observation for upper terminal, (Example 
17.16) 24; sufficiency when both terminals 
depend on parameter, (Examples 17.18, 
17.21-3) 30-1; MV unbiassed estima- 
tion, (Exercise 17.24) 34; ML estimator of 
terminal biassed, (Example 18.1) 38; a ML 
estimator a function of only one of a pair 
of sufficient statistics, (Example 18.5) 42; 
ML estimation of both terminals, (Ex- 
ample 18.12) 54; non-unique ML estima- 
tor of location, (Exercise 18.17) 70; 
ordered LS estimation of location and 
scale, (Example 19.10) 90; confidence 
intervals for upper terminal, (Exercises 
20.2-4, 20.15) 131-2; UMP one-sided 
tests of simple H, for location parameter, 
(Exercise 22.8) 184; minimal sufficiency, 
(Exercise 23.12) 220; power of conditional 
test, (Exercise 23.23) 222; UMP test of 
composite H, for location parameter, 
(Exercise 23.26) 223; connexions with y? 
distribution, 236-7; LR test for location 
parameter, (Exercise 24.9) 258; LR test for 
scale parameter, (Exercise 24.16) 260; dis- 
tribution of order-statistics, (Exercise 
30.17) 464; ARE of Wilcoxon two-sample 
test, (Exercise 31.22) 512; asymptotic 
variance bound for mean, (Exercise 32.10) 
532; estimation from censored samples, 
(Exercise 32.25) 535. 

Regression, and dependence, 278-9; curves, 
281-2; and covariance, 283-4; linear, 
284-7; coefficients, 285; equations, 285; 
computation, (Examples 26.6-—7) 289-92; 
scatter diagram, 292; standard errors, 292- 
293; tests and independence tests, 296; 
linearity and correlation coefficient and 
ratios, 296-9; LR test of linearity, 299- 
301; generally, 346-74; analytical theory, 
346-54, (Exercises 28.1—5) 370; criteria for 
linearity, 350-2; general linear model, 
354-70; adjustment, 356, (Exercise 28.20) 
373; segmented curves, 356; orthogonal, 
356-62, (Example 28.4) 364, (Exercise 
28.23) 374; confidence intervals and tests 
for parameters, 362-5, (Exercises 28.15- 
22) 372-4; confidence region for a regres- 
sion line, 365-70; tests of difference be- 
tween coefficients, (Exercises 28.14—-15) 
371; use of supplementary information, 
(Exercises 28.17-18) 372; graphical fitting, 
(Exercise 28.19) 373; and functional re- 
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lations, 376-7, 378-9, 380, 387; and con- 
trolled variables, 408-9, effect of observa- 
tional errors, 414-16; see also Correlation, 
Least Squares, Linear model, Multiple 
correlation, Partial correlation and regres- 
sion. 

Regressor, 346. 

Residuals, 322 footnote. 

Resnikoff, G. J., tables of non-central t-distri- 
bution, 254. 

Restricted parameters, ML estimation of, 
(Exercises 18.21-3) 71. 

Ricker, W. E., tables of confidence intervals for 
the Poisson parameter, 118. 

Rider, P. R., truncated distributions, 527. 

Robbins, H., variance bounds for estimators, 
17; estimation of normal standard devia- 
tion, (Exercise 17.6) 32; distribution of 
** Student’s ”’ when means of observa- 
tions differ, 467; tolerance intervals and 
order-statistics, 519; moments in sequen- 
tial estimation, 615. 

Robison, D. E., intersection of polynomial 
regressions, 365. 

Robson, D. S., estimation of terminal, (Exer- 
cise 17.13) 33; orthonormal polynomials, 
(Exercise 28.23) 374. 

Robustness, 465-9; in estimation, 469, 502. 

Romig, H. G., double sampling, 607. 

Rosenblatt, M., distribution of Hoeffding’s 
test, 483. 

Rosenthal, I., sampling experiments on G in 
ordered tables, 566. 

Rothenberg, T. J., Cauchy location estimation, 
50. 

Roy, A. R., choice of classes for X? test, 431. 

Roy, J., truncated Poisson, (Exercise 32.20) 
534. 

Roy, K. P., asymptotic distribution of LR 
statistic, 231. 

Roy, S. N., simultaneous confidence intervals, 
128 ; models for the 7 x c table, 559; hypo- 
theses in multi-way tables, 582-3. 

Runs test, (Exercises 30.7—9) 462-3, 502. 

Rushton, S., orthogonal polynomials, 360. 


Saleh, A. K. M. E., censored exponential 
samples, 526. 

Sampford, M. R., truncated negative binomial, 
SZ: 

Sample d.f., 450-1. 

Sampling variance, as criterion of estimation, 
7; see Minimum Variance estimators. 
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Sankaran, M., 
17.23) 34. 

Sarhan, A. E., ordered LS estimation of location 
and scale parameters, (Exercises 19.11-12) 
96-7, (Exercise 32.18) 534; order-statistics, 
513; censored samples, 525, 526, 527, 
(Exercise 32.25) 535. 

Savage, I. R., bibliography of non-parametric 
statistics, 471; normal scores two-sample 

~ test, 498, 501; independence of symmetric 
and rank statistics, (Exercise 32.2) 531. 

Savage, L. J., sufficiency, 24. 

Saw, J. G., censored normal samples, 526. 

Scale parameters, see Location and scale para- 
meters. 

Scale-shift alternative, 503. 

Scatter diagram, 292. 

Scedastic curve, 347. 

Scheffé, H., problem of two means, 141-6, (Ex- 
ample 23.10) 199; linear function of ¥? 
variates, (Exercise 21.8) 159; complete- 
ness, 190; completeness of the linear 
exponential family, 191; minimal suffi- 
ciency, 194; completeness and _ similar 
regions, 196; unbiassed tests, 206; UMPU 
tests for the exponential family, 207; non- 
completeness of Cauchy family, (Exercise 
23.5) 219; minimal sufficiency for bi- 
nomial and rectangular distributions, 
(Exercises 23.11-12) 220; peculiar UMPU 
test for normal distribution, (Exercise 
23.16) 220; unbiassed confidence interval 
for ratio of normal variances, (Exercise 
23.18) 222; UMPU tests for Poisson and 
binomial distributions, (Exercises 23.21—2) 
222; controlled variables, 409, 413; com- 
pleteness of order-statistics, 472 footnote; 
confidence intervals for quantiles, 518; 
tolerance regions, 521. 

Scheuer, E. M., tables of non-central t, 255. 

Schneiderman, M. A., sequential t-tests, 614. 

Scott, E. L., consistency of ML estimators, 61, 
385, 400; incidental and structural para- 
meters, 383. 

Seal, H. L., runs test as supplement to X? test 
of fit, 442, (Exercise 30.7) 462. 

Seelbinder, B. M., double sampling, 619. 

Segmented curves, fitting by LS, 356. 

Sequential methods, 592-624; for attributes, 
(Exercise 18.18) 70, 592-607; closed, open, 
and truncated schemes, 597; tests of hypo- 
theses, 597; OC, 597-8; ASN, 598-9; 
SPR tests, 599-611, (Exercises 34.4, 34.7— 


variance bound, (Exercise 
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11) 621-2; stopping rules, rectifying in- 
spection and double sampling in applica- 
tions, 607; efficiency, 609-11; composite 
hypotheses, 611-14; sequential t-test 
612-14; estimation, 614-18, (Exercises 
34.5—-6) 621; double sampling, 607, 618-19, 
(Exercises 34.15-17) 623; distribution- 
free, 619-20. 

Shah, B. K., logistic tables, (Exercise 32.12) 
533: 

Shah, S. M., truncated binomials, 527. 

Shapiro, S. S., testing normality, 461. 

Shaw, G. B., on correlation and causality, 279. 

Shenton, L. R., cumulants of ML estimators, 
48; efficiency of method of moments, 
(Exercise 18.23) 73. 

Shorrock, R., attainability of MVB, 15. 

Shrivastava, M. P., estimation of functional 
relation, 404. 

Sichel, H. S., ML estimation in lognormal dis- 
tribution, (Exercises 18.7—9) 68-9 

Siddiqui, M. M., censored exponential samples, 
526. 

Siegel, S., two-sample test against scale-shift, 
503. 

Sign test, ARE, (Exercises 25.1-2) 276; 513- 
517; sequential, 619. 

Significance level, 163 footnote. 

Significance tests, see Tests of hypotheses. 

Sillito, G. P., order-statistics, 528. 

Silverstone, H., MVB, 9; comparison of BAN 
estimators, 95. 

Similar regions, similar tests, 187-8; see Tests 
of hypotheses. 

Simple hypotheses, 161-85; see Hypotheses, 
‘Tests of hypotheses. 

Singh, R., completeness, 190. 

Singular matrix, pseudo-inverse of, (Exercise 
27.21) 345. 

Size of a test, 163; choice of, 182-4. 

Slakter, M. J., approximation of X?, 440. 

Smirnov, N. V., test of fit, 451; tables of 
Kolmogorov test of fit, 457; one-sided test 
of fit, 457; two-sample test, 502. 

Smith, C. A. B., moments of X? in ar xc table, 
(Exercise 33.9) 585. 

Smith, J. H., ML estimation of cell prob- 
abilities in a rxc table, (Exercise 33.29) 
590. 

Smith, K., MCS estimation in the Poisson dis- 
tribution, (Example 19.11) 94. 

Smith, S. M., cumulants of a ML estimator, 46. 

Smith, W. L., truncation and sufficiency, 522. 
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** Smooth ”’ tests of fit, 444-50. 

Sobel, M., an independence property of the 
exponential distribution, (Exercise 23.10) 
220; censoring in exponential samples, 
526-7, (Exercise 32.17) 534. 

Soper, H. E., standard error of biserial cor- 
relation coefficient, 310-11. 

SPR tests, sequential probability ratio tests, 
599-611, (Exercises 34.4, 34.7-11, 34.21-2) 
621-4. 

Sprent, P., linear functional relations, 394. 

Sprott, D. A., fiducial and Bayesian methods, 
157. 

Spurgeon, R. A., tables of non-central t, 255. 

Statistical relationship, 278. 

Stein, C., estimation of restricted parameters, 
(Exercise 18.22) 71; non-similar tests of 
composite Hy, 200; a useless LR test, 
(Example 24.7) 246; invariance and suffi- 
ciency, 256; tests of symmetry, 507; SPR 
tests, 600; sequential completeness, 614 
footnote; double sampling, 618-19. 

Stephan, F. F., ML estimation of cell prob- 
abilities in a r xc table, (Exercise 33.29) 
590. 

Stephens, M. A., tests of fit, 452. 

Sterne, T. E., confidence intervals for a pro- 
portion, 118. 

Stevens, W. L., confidence intervals in the dis- 
continuous case, 119; fiducial intervals for 
binomial parameter, (Exercise 21.3) 159; 
runs test, (Exercise 30.8) 463. 

Stochastic convergence, 3; see Consistency in 
estimation. 

Structural parameters, 383. 

Structural relationship, see Functional and 
structural relations. 

Stuart, A., estimation of normal correlation 
parameter, (Exercise 18.12) 70; ARE and 
maximum power loss, 273; intra-class cor- 
related multinormal distribution, (Exer- 
cises 27.18-19) 345; critical region for X? 
test of fit, 422; joint distribution of rank 
correlations, 481; ARE of tests of random- 
ness, 487, (Exercises 31.10-12) 510-11; 
correlation between ranks and _ variate- 
values, (Exercises 31.13—14) 511; testing 
difference in strengths of association, 565; 
test of identical margins in r xv, tables, 
(Exercise 33.24) 589. 

*“ Student’? (W. S. Gosset), studentization, 
124; outlying observations, 528. 

Studentization, 123-5, (Example 20.7) 125-7. 
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Studentized extreme deviate, 528. 

Studentized maximum absolute deviate, 529. 

Studentized range, 529. 

Subrahmaniam, K., truncated Poisson, (Exer- 
cise 32.20) 534. 

Sufficiency, generally, 22-31; definition, 22-3; 
factorization criterion, 23; and MVB, 24; 
functional relationship between sufficient 
statistics, 25; and MV estimation, 25-6; 
distributions possessing sufficient statis- 
tics, 26-7; for several parameters, 27-8; 
when range of parent depends on para- 
meter, 28-31, (Exercise 24.17) 260; single 
and joint, 27; distribution of sufficient 
statistics for exponential family, (Exercise 
17.14) 33, (Exercise 24.17) 260; and ML 
estimation, 36—7, (Example 18.5) 43, 52-3; 
and BCR for tests, 170, 177-80, (Exercises 
22.11, 22.13) 185; optimum test property 
of sufficient statistics, 187; and similar 
regions, 189-90, 195-6; minimal, 193-5, 
(Exercise 18.13) 70, (Exercise 23.31) 223; 
ancillary statistics and quasi-sufficiency, 
217; independence and _ completeness, 
(Exercises 23.6—7) 219; and LR tests, 245; 
and invariance, 256; and nuisance para- 
meters, (Exercise 30.18) 464; and trunca- 
tion, 522. 

Sufficient statistics, see Sufficiency. 

Sukhatme, P. V., LR tests for k exponential 
distributions, (Exercises 24.11-13) 259- 
260. 

Sundrum, R. M., estimation efficiency and 
power, 171, (Exercise 22.7) 184; power of 
Wilcoxon test, 498. 

Superefficiency, 44. 

Supplementary information, in regression, 
(Exercises 28.16-18) 372; instrumental 
variables, 398-408. 

Swamy, P. S., efficiency under truncation and 
censoring, 524-5. 

Swed, F. S., tables of runs test, (Exercise 30.8) 
463. 

Symmetrical distributions, ML estimators of 
location and scale parameters uncorrelated, 
63-4; ordered LS estimators of location 
and scale parameters uncorrelated, 89~90; 
condition for LS estimator of location 
parameter to be sample mean, (Exercise 
19.10) 96; ARE of Sign test, 515-16, 
(Exercise 32.1) 531; tests of symmetry, 
506-8. 

Symmetry tests, 506-8, (Exercise 31.25) 512. 
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Taguti, G., tables of normal tolerance limits, 
130. 

Tamura, R., scale-shift tests, 503. 

Tang, P. C., non-central F distribution and 
tables of power function for linear hypo- 
thesis, 253-4; non-central x? distribution, 
(Exercise 24.1) 257. 

Tanis, E. A., testing exponential distributions, 
(Exercise 24.13) 260. 

Tarter, M. E., logistic order-statistics, (Exer- 
cise 32.12) 533. 

Tate, R. F., MV unbiassed estimation, (Exer- 
cise 17.24) 34; tables of confidence inter- 
vals for a normal variance, 118; shortness 
of these intervals, (Exercise 20.10) 131; 
biserial and point-biserial correlation, 311-— 
312, (Exercises 26.10-12) 313-14; trun- 
cated Poisson distribution, 526, (Exercise 
32.19) 534. 

Teicher, H., characterization by form of ML 
estimator, (Exercise 18.2) 67; moments in 
sequential estimation, 615. 

Terminal, estimation of, (Exercise 17.13) 33. 

Terpstra, T. J., k-sample test, 506. 

Terry, M. E., tests using normal scores, 487, 
498. : 

Tests of fit, 419-64; LR and Pearson tests for 
simple H,, 420-3; X? notation, 421 foot- 
note; composite H,, 423-30; choice of 
classes for X? test, 430-3, 437-40; mo- 
ments of X®? statistic, 433-4, (Exercises 
30.3, 30.5) 462; consistency and un- 
biassedness of X? test, 434-6; limiting 
power of X? test, 436-7; use of signs of 
deviations, 441-2, (Exercises 30.7—9) 462- 
463; other tests than X?,442-61 ; “‘smooth”’ 
tests, 444-50; connexion between X? and 
** smooth ” tests, 449; components of X?, 
449-50; tests based on sample d.f., 450- 
461; Smirnov test, 451-2; Kolmogorov test, 
452-61; comparison of X? and Kol- 
mogorov tests, 459-60; tests of normality, 
461. 

Tests of hypotheses, 161-277; and confidence 
intervals, 117-18, 206; size, 163, 182-4; 
power, 164-5, 179-80; BCR, 165; simple 
H, against simple H,, 166; randomization 
in discontinuous case, 166; BCR and suffi- 
cient statistics, 170, 177-80; power and 
estimation efficiency, 171-2, (Exercise 
22.7) 184; simple Hy against composite 
H,, 172; UMP tests, 172; no UMP test 
generally against two-sided H,, 173-4; 
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UMP tests with more than one parameter, 
174-80, (Exercise 22.11) 185; one- and 
two-sided tests, 180-1; optimum property 
of sufficient statistics, 187; similar regions 
and tests, 187-90, 195-200, 205-6; exist- 
ence of similar regions, 188-9, (Exercises 
23.1-3) 219, (Exercise 23.29) 223; similar 
regions, sufficiency and bounded com- 
pleteness, 189-90, 196-200; most powerful 
similar regions, 196; non-similar tests of 
composite H,, 200; bias in, 200; unbiassed 
tests and similar tests, 205; older termino- 
logy, 206; UMPU tests for the exponential 
family, 207-17; ancillary statistics and 
conditional tests, 217-19; LR tests, 224- 
247; unbiassed invariant tests for location 
and scale parameters, 242-5, (Exercise 
24.15) 260; general linear hypothesis, 249- 
256; comparison of, 262—77; sequential, 
597-614, 618-20; see also Asymptotic re- 
lative efficiency, Hypotheses, Likelihood 
Ratio tests. 

Tetrachoric correlation, 304-7; series, 306, 
(Exercises 26.5—6) 313; series and canoni- 
cal correlations, 570. 

Thatcher, A. R. binomial prediction, 157. 
Theil, H., estimation of functional relation, 
405, 406, 413, (Exercises 29.9-10) 417. 
Thompson, W. R., confidence intervals for 

median, 518; outlying observations, 529. 

Tienzo, B. P., studentized extreme deviate, 
529. 

Ties, and distribution-free tests, 508-9. 

Tiku, M. L., non-central y? and F distribu- 
tions, 229, 253, 254; tests of fit, 452. 
Tilanus, C. B., Cauchy location estimation, 50. 
Tocher, K. D., optimum exact test for 2 x 2 

table, 554. 

Todhunter, I., Arbuthnot’s use of Sign test, 
513 footnote. 

Tokarska, B., tables of power function of 
** Student’s ”’ t-test, 255. 

Tolerance intervals, for a normal distribution, 
128-30, (Exercises 20.20—2) 133; distribu- 
tion-free, 518-21, (Exercises 32.6-7) 
531-2. 

Tolerance regions, 521. 

Transformations, of functional relations to 
linearity, 413-14; to normality, 469. 

Trend, 483. 

Trickett, W. H., tables for problem of two 
means, 148. 

Trimmed estimators, 525, 529. 
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Truncated distributions, 522—7; estimation of | Verdooren, L. R., tables of Wilcoxon test, 494. 


truncation point, (Exercise 17.13) 33. 
‘Truncated sequential schemes, 593. 
T'schuprow, A. A., coefficient of association, 557. 
Tukey, J. W., fiducial paradox, 155; two- 

sample test against scale-shift, 503; con- 

fidence intervals for quantiles, 518; toler- 
ance regions, 521; truncation and suff- 

ciency, 522; outlying observations, 529. 
Two means problem, see Normal distribution. 
‘Two-sample tests, 487-503. 
2x2 tables, association in, 536-41, (Exercises 

33.1-2) 584; partial association, 541-5; 

probabilistic interpretation of measures, 

545-7; large-sample tests of independence, 

547-9; exact test of independence for 

different models, 549-55; continuity cor- 

rection, 555-6, (Exercise 33.5) 584; com- 

ponents of X? in 2x2x2 tables, 580, 

(Exercises 33.27-8) 590; test of identical 

margins in 2° tables, (Exercise 33.30) 590. 
Type A, A,, B, B,, C tests, 206. 

Type A contagious distribution, ML and 

moments estimators, (Exercise 18.28), 73. 
Type M unbiassed tests, 202. 

Type I, Type II censoring, 522-3. 
Type IV (Pearson) distribution, centre of loca- 
tion and efficiency of method of moments, 

(Exercise 18.16) 70. 


UMP, uniformly most powerful, 172. 

UMPU, uniformly most powerful unbiassed,202. 

Unbiassed estimation, 4-5; absurd, (Exercise 
17.26) 34; see Bias in estimation. 

Unbiassed tests, 202; and similar tests, 205. 

Uniform distribution, see Rectangular distribu- 
tion. 

Uniformly most powerful (UMP), 172; see 
‘Tests of hypotheses. 

Uniformly most powerful unbiassed (UMPU), 
202; see Tests of hypotheses. 


Vail, R. W., two-stage LS, (Exercise 28.24) 374. 

Van der Vaart, H. R., Wilcoxon two-sample 
test, 495, 498. 

Van der Waerden, B. L., two-sample test, 499. 

Van Eeden, C., ARE and correlation, (Exercise 
25.9) 277; scale-shift tests, 503. 

Van Yzeren, J., estimation of functional rela- 
tion, 405. 

Variance, generalized, see Generalized variance. 

Variance, lower bounds to, see Minimum Vari- 
ance Bound. 


Villegas, C., linear functional relations, 394. 

Votaw, D. F., Jr., truncation and censoring for 
the exponential distribution, 526, (Exer- 
cise 32.16) 533. 


Wald, A., consistency of ML estimators, 41; 
asymptotic normality of ML estimators, 
54; tolerance intervals for a normal dis- 
tribution, 128-30; problem of two means, 
148; asymptotic distribution of LR statis- 
tic, 231; test consistency, 240, 262; asymp- 
totic properties of LR tests, 246; optimum 
property of LR test of linear hypothesis, 
256; asymptotically most powerful tests, 
263; estimation of functional relation, 
400-1, (Exercises 29.4-5) 416; choice of 
classes for X®* test of fit, 432, 438-9; un- 
biassedness of X? test of fit, 434; one- 
sided test of fit, 457; variance of X? 
statistic, (Exercise 30.3) 462; runs test, 
(Exercise 30.8) 463, 502; tolerance regions, 
521; sequential methods, 599-603, 609, 
611-12, (Exercises 34.3-4, 34.7, 34.11-13, 
34.18-19) 620-4. 

Walker, A. M., efficiency of estimators, 44. 

Wallace, D. L., asymptotic expansions in prob- 
lem of two means, 148. 

Wallace, T. D., two-stage LS, (Exercise 28.24) 
374. 

Wallis, W. A., tolerance limits, 130; Wilcoxon 
test, 493-4, (Exercise 31.15) 511; k-sample 
test, 504; treatment of ties, 509. 

Walsh, J. E., measure of test efficiency, 275; 
effect of intra-class correlation on ‘‘ Stu- 
dent’s ” t-test, (Exercise 27.17) 345; power 
of Sign test, 516; tolerance intervals, 521; 
censored normal samples, 526; outlying 
observations, 531. 

Watson, G. S., problem of two means, (Ex- 
ample 23.10) 200; non-completeness and 
similar regions, (Exercise 23.15) 221; dis- 
tribution of X? test of fit, 425, (Exercise 
30.2) 462; choice of classes for X” test of 
fit, 450; modified Smirnov test, 452. 

Weather and crops, Hooker’s data, (Example 
27.1) 229-31. - 

Weissberg, A., tables of normal tolerance limits, 
130. 

Welch, B. L., problem of two means, 146-8; 
Bayesian and confidence intervals, 157; 
power of conditional tests, 217-18, (Exercise 
23.23) 222; non-central ¢-distribution, 254. 
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Whitaker, L., data on deaths of women, 94. 

Whitcomb, M. G., unbiassed tests for a normal 
variance, 213. 

White, C., tables of Wilcoxon test, 494. 

Whitlock, J. H., estimation of terminal, (Exer- 
cise 17.13) 33. 

Whitney, D. R., Wilcoxon test, 493. 

Whittinghill, M., testing homogeneity, 579. 

Wicksell, S. D., analytical theory of regression, 

~ (Exercise 26.2) 312, 348, (Exercise 28.4) 

370; criterion for linearity of regression, 351. 

Widder, D. V., Laplace transform, 190 foot- 
note. 

Wiener, N., completeness for location para- 
meter, 190. 

Wysman, R. A., invariance and sufficiency, 256, 
614. 

Wilcoxon, F., two-sample test, 493. 

Wilcoxon test, 493-8, 502, 503; ARE, (Exer- 
cises 25.3-4) 277, 495-8, (Exercises 31.14- 
19, 31.22—-24) 511-12; for symmetry, 508, 
(Exercise 31.25) 512; extension to censored 
samples, 527; for a group of outlying 
observations, 530. 

Wilénski, H., confidence limits for the Poisson 
distribution, 118. | 

Wilk, M. B., moments of non-central t, 255; 
testing normality, 461; censored Gamma 
samples, 527. 

Wilkinson, G. N., truncated binomial, 527. 

Wilks, S. S., shortest confidence intervals, 115; 
smallest confidence regions, 128; confi- 
dence intervals for upper terminals of 
rectangular distribution, (Exercise 20.2) 
131; asymptotic distribution of LR statis- 
tic, 231; review of literature of order- 
statistics, 513; tolerance intervals, 519, 521. 

Williams, C. A., Jr., X? test of fit, 439, 460. 

Williams, E. J., comparison of predictors, 
(Exercise 28.22) 374; scoring methods in 
yr x c tables, 568; canonical analysis, (Exer- 
cise 33.13) 586. 

Winsorized estimators, 525, 529. 

Wise, M. E., approximations to X°, 421. 

Wishart, J., non-central y* and F distributions, 
229, 253, (Exercise 24.1) 257; moments of 
multiple correlation coefficient, 341, (Exer- 
cises 27.10-11) 344. 

Witting, H., ARE of Wilcoxon test, 498; ARE 
of Sign test, 517. 

Wolfowitz, J., non-existence of ML estimator, 
(Exercise 18.34) 74; tolerance intervals for 
a normal distribution, 118-20; test con- 
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sistency, 240, 262; optimum properties of 
LR test for linear hypothesis, 255; con- 
sistency of ML estimators, 385, 388; one- 
sided test of fit, 457; comparison of X? 
and Kolmogorov tests, 460; tests of nor- 
mality, 461; runs test, (Exercise 30.8) 463, 
502; sequential estimation, 615, (Exercise 
34.6) 621. 

Women’s measurements and sizes, data from, 
(Tables 26.1—2) 280-1. 

Woodward, J., truncated normal distribution, 
525; 

Working, H., confidence region for a regression 
line, 365, 368-9, (Exercise 28.13) 371. 

Wormleighton, R., tolerance regions, 521. 


y° distribution, see Gamma distribution, Non- 
central x? distribution. 

X* test of fit, 421-44; partitions of, 449-50; in 
2x2 tables, 549, 555; inrxc tables, 556— 
561, (Exercise 33.9) 585; partitions in r x c 
tables, 574-8; dispersion tests, 578-80, 
(Exercises 33.20—2) 588-9; in multi-way 
tables, 580-3, (Exercises 33.7—-8) 585; see 
Tests of fit. 


Yates, F., fiducial inference, 154; testing differ- 
ence between regressions, (Exercise 28.14) 
371; tests using normal scores, 487, 498; 
data on mal-occluded teeth, (Example 
33.5) 552; continuity correction in 2 x2 
tables, 555; scoring in rxc tables, 568, 
(Exercise 33.12) 586. 

Yields of wheat and potatoes, data on, (Table 
26.3) 291. 

Young, A. W., standard error of X? in rxe 
tables, 560. 

Yule, G. U., development of correlation theory, 
279; partial correlation notation, 317; 
large-sample distributions of partial co- 
efficients, 333; relations between coeffi- 
cients of different orders, (Exercise 27.1) 
343; data on inoculation, (Example 33.1) 
538; coefficients of association and colliga- 
tion, 539; invariance principle for measures 
of association, 546-7. 


Zackrisson, U., distribution of ‘‘ Student’s ” t 
in mixed normal samples, 467. 

Zalokar, J., tables of studentized maximum 
absolute deviate, 529. 

Zeigler, R. K., double sampling, 619. 

Zyskind, G., two-stage LS, (Exercise 28.24) 
374. 


